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DIFFERENTIAL RETENTION OF COURSE OUTCOMES 
IN EDUCATIONAL PSYCHOLOGY 


WILLIAM P. MCDOUGALL 
Washington State College 


One of the paramount problems of all 
educational endeavor is that of making the 
learning experiences of students more 
lasting. Though the problem of retention 
has been studied in many different school 
subjects, relatively little research has been 
reported dealing directly with the per- 
manency of different kinds of course out- 
comes. The need for such evidence is 
suggested by the following quotation from 
the Taronomy of Educational Objectives. 


For the most part research on the prob- 
lems in retention, growth and transfer has 
not been very specific with respect to the 
particular behavior involved. Thus, we are 
not usually able to determine from this re- 
search whether one kind of behavior is re- 
tained for a longer period of time than 
another or which kinds of educative ex- 
periences are most efficient in producing a 
particular kind of behavior. Many claims 
have been made for different educational 
procedures, particularly in relation to per- 
manence of learning; but seldom have these 
been buttressed by research findings (2, p. 
23). 


It was the purpose of this study to 
measure retention of different course out- 
comes in a beginning course in educational 
psychology. The outcomes examined in- 
cluded: (a) knowledge and the intellec- 
tual abilities and skills, (b) translation, 
(c) interpretation, and (d) extrapolation. 
These objectives were defined by the 
Taxonomy of Educational Objectives (2), 
a handbook consisting of a logical and 
psychological classification of educational 
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goals. This handbook enables test con- 
structors to define very clearly the classes 
of behavior being measured in that it 
provides extensive definitions together 
with examples of test situations measuring 
the various behavioral objectives. 


PROCEDURE 


The general plan of the study involved 
the construction of tests to measure a 
variety of educational objectives in a 
beginning course in educational psychology 
at the University of Nebraska. The course, 
Human Behavior and Development, is the 
second of a two-course sequence taken by 
teacher trainees. It encompasses primarily 
the content areas of learning and evalua- 
tion. For this study, the content con- 
sidered was delimited to the materials 
studied about tests and measurements in 
order to permit more intensive and uni- 
form sampling of the objectives. 

The tests were related to the course by 
using the course syllabus and accompany- 
ing references which were used by all 
instructors teaching the various sections 
of the course. For each of the objectives 
tested, a few examples of items patterned 
after the “Taxonomy” definitions follow: 


Knowledge 


1. Which of the following is most easily 
measured by a test: (a) problem-solving 
ability, (b) study skills, (c) factual infor- 
mation, (d) ability to comprehend. 

2. Which of the following is an individual 
intelligence test: (a) California Test of 
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Mental Maturity, (b) Stanford Binet, (c) 
Ohio State Psychological Test, (d) Primary 
Mental Abilities. 

3. A test that places minor emphasis on 
the time limit is called a: (a) diagnostic 
test, (b) performance test, (c) survey test, 
(d) power test. 

4. Which of the following would be of 
most value in determining the typical be- 
havior of a student: (a) observation, (b) 
projective testing, (c) individual intelligence 
testing, (d) school achievement records. 

Item 1 is designed to measure knowledge 
of specific fact, Item 2, knowledge of a 
classification, Item 3, knowledge of termi- 
nology, and Item 4, knowledge of method- 


ology. 
Translation 


1. A major use of testing is for diagnosis. 
Which of the following test situations rep- 
resents the best example of the foregoing 
statement? (a) a comprehensive achieve- 
ment battery at the end of high school, (b) 
an achievement battery given early in the 
year, (c) an intelligence test, (d) a series 
of tests used to determine a student’s grade. 

2. If Bill scored at the 88th percentile in 
Social Service on the Kuder Preference 
Test, it would indicate that: (a) Bill got 
88% of the answers correct, (b) he has more 
ability in Social Service than 88% of his 
norm group, (c) only 12% of the norm group 
showed more interest in Social Service than 
he did, (d) that 88 out of 100 will do better 
than he did on this test. 

The first exercise involves translation of 
a formal statement by requiring the student 
to identify a concrete example. The second 
item involves the translation of quantitative 
data to its corresponding verbal meaning. 


Interpretation 


Data are given below on five pupils en- 
rolled in a class of 30 ninth graders. The 
test data are based on performance at the 
end of the first semester. Read over the 
summary and then show which pupil each 
statement best fits by marking the pupil’s 
number on the answer sheet. 

Teacher's 
Estimate 
of Ach. 


Calif. Ach. 
Test Per- 
formance Rank in 
Arith. Read. Lang. Class 
9.1 8.0 8.3 20 
9.7 9.6 9.5 4 
9.5 9.8 10.2 12 
11.8 12.3 12.0 3 
10.0 10.1 10.9 4 


1. The pupil who should be doing con- 
siderably better in his school achievement. 

2. The accuracy of the IQ seems most 
doubtful in which case? 

3. A bright student making good use of 
his ability. 

4. Teacher regards abilities too highly 
according to test results. 3 

5. Teacher’s rank most consistent with 
test scores. 

Each of the foregoing situations involves 
the ability to deal with a configuration of 
ideas or data recognizing the relationship 
and relative importance of each. The infer- 
ences or generalizations made from the data 
do not extend beyond the data but are con- 
fined to the material presented. 


Extrapolation 


The five students for whom the data are 
given below are in kindergarten. These test 
data are based on test performance at the 
beginning of the second semester. After 
examining the data, indicate which pupil 
best fits each of the following statements by 
marking the number of the student on the 
answer sheet. 


Percentile 
Rank on 
Stanford Readiness 
Student Binet Test 


1 7+ 72 
2 644 54 22 
3 5-5 64 
4 5-8 5-6 45 
5 5-6 6-10 38 


Which student: 

1. Is apparently in need of stimulating 
experiences but has fairly high aptitude? 

2. Apparently comes from a very stim- 
ulating environment? 

3. Is most characteristic of the average 
for this group? 

4. Can you predict will have the lowest 
ability three years from this time? 

The first two situations require the stu- 
dent to extend the implications of the data 
to another topic or situation. The third 
situation requires extension from a sample 
to a universe. The last item involves time 
dimension and requires prediction on the 
basis of the data presented. 


The tests were then administered on a 
trial basis to a group of 75 educational 
psychology students who had completed 
units on tests and measurements. The 


MA on 
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tests were then analyzed, refined, and used 
as instruments to study the retention of 
the different course outcomes. The re- 
fined tests were given as a pretest, a test 
at the completion of the course, and a re- 
test approximately four months later. 
There were 301 students who took both 
the pretest and the test, and 172 of this 
group took the retest. This latter group 
was used in the study of retention. 

The appropriateness of the tests in- 
volved in the study was examined after 
the test was administered to the trial 
test group and again after the test had 
been revised. i 

The original trial tests contained ap- 
proximately 30 items measuring each ob- 
jective. The curricular validity of items 
was established by agreement among sev- 
eral instructors teaching the course in- 
volved in the study. Pooled judgment of 
several instructors was also used to assure 
that each item was correctly matched with 
the corresponding “Taxonomy” definition. 
The items were then studied after admin- 
istration to the trial test group. Item dif- 


ficulty and item discrimination were de- 


termined and substandard items were 
dropped or revised. Evidence of ambigu- 
ity in items and ineffective distractors 
were also studied and many items were 
revised or eliminated on this basis. The 
resulting refined tests contained approx- 
imately 24 items each, and when combined 
required approximately one and one-half 
to two hours for administration. 

The tests were studied again at the time 
of the second testing in the retention ex- 
periment. At this time, 310 people took 
the test as part of their course final ex- 
amination. Item difficulty, item discrim- 
ination, and test reliability were deter- 
mined. Homogeneity of behavior measured 
by the different tests was studied in two 
ways: First, the correlations of the items 
with their respective test totals were com- 
pared with item correlations using the 
total of the four tests combined as a cri- 
terion. Second, an F test for departure 


from homogenity proposed by Neidt (10, 
p. 390) was applied. This latter technique 
indicated whether or not there is a rela- 
tively greater lack of homogeneity be- 
tween or among areas than within areas 
measured. The semiexternal criterion of 
course marks was correlated with the 
scores to further establish test validity. 
The correlation coefficients between each 
test and a measure of scholastic aptitude, 
the L score on the American Council 
on Education Psychological Examination, 
was computed to determine the degree to 
which verbal ability was present in each 
of these tests. 

In the study of retention, the suita- 
bility of the sample of 172 students who 
took the retest was determined by com- 
paring the performance of this group with 
the performance of the group who did 
not take the retest. The degree of relation- 
ship between the scores on each test ad- 
ministration was found by computing 
correlation coefficients between the pre- 
test and the test, the test and retest, and 
the pretest and retest. The differences 
between the means of scores on each of 
the test administrations was determined 
and tested for significance. Retention was 
then studied by computing the average 
percentage of gain retained for each of the 
separate objectives measured. 


REsvULTs 


Analysis of the Tests 


The test item difficulty, reported in 
terms of percentage of the group who 
responded correctly to the item, was de- 
termined for all tests. The mean level of 
difficulty for the knowledge test was 
62.13%. The means for translation, inter- 
pretation, and extrapolation were 60.45, 
60.61, and 56.52, respectively. The indi- 
vidual difficulty percentages tended to 
cluster about the means and seemed to be 
well distributed with no items either being 
answered correctly or missed by 100% of 
the group. 
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Item discrimination was determined by 
correlating each item with the total test 
score. To obtain these correlations, the 
upper and lower 27% of the distribution 
are designated as the criterion variable, 
and by entering the appropriate percent- 
ages in an item analysis table (4), the 
correlations may be estimated. Such cor- 
relations indicate the tendency for stu- 
dents who make high scores on the total 
test to mark the individual item correctly. 

On the combined tests, 54% of the total 
items were found to yield correlations of 
40 or above. Twenty-nine per cent of the 
total items were between .20 and .30. 
Only fifteen, or 17% of the total items, 
yielded correlations of less than .20. Two 
items were found to yield negative cor- 
relations and were eliminated from use in 
the retention study. The above percent- 
ages were fairly characteristic of all of 
the tests with slightly more low-correla- 
tion items in the knowledge and trans- 
lation tests than in the interpretation and 
extrapolation tests. 

The Spearman-Brown and the Kuder- 
Richardson estimates of reliability are 
shown in Table 1. 

Apparently the small number of items 
included in each test is the major reason 
for the somewhat low reliabilities. For 
evaluating the level of group accomplish- 
ment, such reliabilities may be regarded 


TABLE 1 


SpeaRMAN-BROWN AND KupER-RICHARDSON 
EstTIMaTEs OF RELIABILITY 








Odd- | Spear- | Kuder- 
even | man- /Richard- 
corre-| Brown | _ son 

lation | Estimate) Estimate 


Knowledge .297 | .458 495 
Transla- .290 | .450 .507 
tion 
Interpre- 477 
tation 
Extrapola- 440). .537 
tion 





-646 531 

















as acceptable according to some sources 
(8, p. 609). Certainly higher reliabilities 
would be more desirable, but in this ex- 
periment the limiting factor of testing 
time would have made it extremely difficut 
to include more items in the tests. 

One positive indication of homogeneity 
of the behavior measured by the different 
tests can be obtained by comparing the 
individual item correlations when using 
the respective test scores as a criterion 
with those obtained by using the total 
scores of the four tests combined as a cri- 
terion. Since the total score constitutes 
the criterion with which the item is com- 
pared, the higher the correlation the more 
the behavior measured by each item is 
like the behavior measured by the total 
test. It was noted that when all of the 
tests were combined into one single test 
score and the items correlated with this 
total, most of the correlations were re- 
duced. This reduction would indicate a 
greater heterogeneity of test content when 
tests were combined or, conversely, a 
greater homogeneity of content in the 
separate tests. It was not possible by this 
method, however, to determine the degree 
to which each test is homogeneous with 
respect to each other test. To test this 
hypothesis, an F test for departure from 
homogeneity was applied. Intra- and 
interarea correlations were obtained and 
averaged according to the function 4% 
log, (1 + r)/(1 — r) as necessary for sub- 
stitution into the formula for computing 
the F values which is: 


1+ fe — 27, 
1 — fe 


P 


where 7, is the average intra-area coef- 
ficient and 7, is the average interarea 
coefficient of correlation. The resulting 
F values are shown in Table 2. Inspection 
of Table 2 shows that the resulting F 
values are significant beyond the 1% level 
of confidence between knowledge and 
translation, knowledge and interpretation, 
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and translation and extrapolation. The F 
value for translation and interpretation 
is significant at the 5% level. The F values 
between knowledge and extrapolation and 
interpretation and extrapolation are not 
significant, although the first of these ap- 
proaches significance. The hypothesis that 
behaviors measured by the different tests 
were homogeneous with respect to each 
other can be rejected between all tests ex- 
cept knowledge and extrapolation and in- 
terpretation and extrapolation. 

On the basis of these results the inter- 
pretation and extrapolation tests were 
combined since they did not seem to be 
performing separate functions. The re- 
sulting F values with these two tests com- 
bined are shown in Table 3. 

Inspection of Table 3 reveals that all 
values of F are significant beyond the 1% 
level of confidence. The hypothesis that 
these three tests measure behaviors homo- 
geneous with respect to each other is 
rejected. The remainder of the experi- 
ment considered interpretation and ex- 
trapolation as a single test. The resulting 
Spearman-Brown estimate of reliability 
for this test would become .773. 

A semiexternal criterion, namely final 
course marks, was employed to obtain a 
measure of empirical validity. The result- 
ing correlations between the tests and 
final grades centered about .60, demon- 
strating a high positive relationship using 
such a criterion. These correlations would 
be spurious to the extent that the tests 
used in the experiment consituted as much 
as one-sixth of the final grade. 

The correlations between the L score of 
the American Council on Education Psy- 
chological Examination and each test are 
as follows: 

Test r 
. 364 


362 
343 


Knowledge 

Translation 

Interpretation-Extrapola- 
tion 


It is evident from the inspection of these 


TABLE 2 


VaLves or F ror Tests or HOMOGENEITY 
Between Tests 


] l 
| Trans- | Inter- | Extra- 


| lation | pretation | polation 


Knowledge | 1.388 | 1.332 | 1.178 
Translation | 1.310 | 1.423 
Interpretation | | 1.005 
Note.— Required for significance, 309 and 309 degrees 
of freedom, 1% = 1.33 
5% = 1.22 


TABLE 3 
Vatues or F ror Tests or HOMOGENEITY 
BeTwEEN Tests 
(INTERPRETATION-EXTRAPOLATION 
CoMBINED) 


I woe on 
Translation | tion-Extra- 
polation, 


1.388 
| 1.489 


Knowledge 
Translation 





Note.— Required for significance, 309 and 309 degrees 
of freedom, 1% = 1.33 
5% = 1.22 


coefficients that the influence of the scho- 
lastic aptitude factor as measured by the 
L score is equally present in the perform- 
ance required by the different tests. 


The Study of Retention 


The differences on scores of the 172 
students who took the retest and those 
who did not were determined and tests of 
significance applied. It was established 
that this sample was characteristic of 
the population of 301 from which it was 
taken. 

The possibility that subject matter 
learned in other courses might transfer 
was also considered. It was discovered 
that none of the students who participated 
took courses during the retention period 
that dealt systematically with the area of 
tests and measurements. Apparently it 
was safe to conclude that only incidental 
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amounts of transfer, if any, would be ex- 
pected. 

The relationship between the three ad- 
ministrations of each test was studied. 
The resulting correlation coefficients were 
positive in all cases but not to a high 
degree. The values ranged from .274 to 
599 and averaged about .44. Such corre- 
lations indicated that individuals tended 
to maintain their relative rank on the 
successive test administrations. The cor- 
relations were slightly higher for the inter- 
pretation-extrapolation test than for the 
others, with an average correlation of 
542. 

To determine if the differences in mean 
performance on the various test adminis- 
trations were significant, a t test for 
correlated data was applied. In Table 4 
the differences, together with the ac- 
companying t values, are shown. 

It may be noted from inspection of 
Table 4 that all of these differences are 
significant at the 1% level except the 
difference between the pretest and retest 


for the interpretation-extrapolation test. 
This difference is significant at the 2% 
level. These results show that, on the aver- 
age, a significant amount of material was 
learned during the instruction period, a 
significant amount forgotten during the 
four-month retention period and at the 
end of the retention period the students 
still retained enough learning so that their 
performance was significantly different 
from that at the time of the pretest. 
The amount of material retained for 
each test may also be reported in terms 
of percentage of gain retained. These 
percentages are also reported in Table 4. 
To determine if the differences between 
the percentages were significant, a t test 
for correlated data was applied. The dif- 
ference of .78% between knowledge and 
translation vielded a t value of .222 which 
is not significant. The percentage differ- 
ence between knowledge and interpreta- 
tion-extrapolation was 6.55 with an ac- 
companying t of 1.985 which is significant 
at the 5% level. The difference of 5.77% 


TABLE 4 


Mean Scores, Dirrerences Between Mean Scores with ACCOMPANYING t VALUES, 
AND PERCENTAGE OF GAIN RETAINED FOR THE THREE ADMINISTRATIONS OF THE TESTS 








Test 





| 
| 
| 


Knowledge 
(N = 172) 


Interpretation- 
Extrapolation 
(N = 172) 


Translation 
(N = 172) 





Pretest Mean (Mp) 

Test Mean (Mr) 

Retest Mean (Mr) 

Mr — Mp (Gain) 
t 

Mr — Mr (Loss) 
t 

Mr — Mp (Gain Retained) 
t 

Mr 


— Mp : F 
M: — Mr (Percent of Gain Retained) 





% = 72.60 


11.05 
13.83 
13.09 


11.90 
15.55 
14.55 


SN 


=SS88e 8S 





3.65 2.78 
10.42 13.29 
1.00 74 
3.33 2.74 
2.65 2.04 
10.19 8.16 


ms 
Ome wa 





% = 73.38 








Note.—Required for Significance, 171 Degrees of Freedom, 1% = 2.58 


2% = 2.32 
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between translation and interpretation- 
extrapolation yielded a t of 1.748 which 
is significant at the 10% level of confi- 
dence. 

The course of learning and retention for 
each behavior studied may also be ex- 
pressed in terms of percentage of items 
answered correctly at each testing. 

The greatest gain and relatively the 
greatest loss were made on the knowledge 
test, the percentage of items correct in- 
creasing during the course from 49.6% 
to 64.8% and dropping off to 60.6%. The 
average percentage of items correct for 
translation began with 50.2%, increased to 
62.4% and dropped to 59.5%. The cor- 
responding percentages for interpretation- 
extrapolation were 49.2%, 60.2%, and 
57.9%. 


DIscussIOoN 


The results of this study indicate the 
need for carefully delineated course ob- 
jectives. The homogeneity analysis in this 
experiment showed that tests constructed 


to measure certain behavioral outcomes 
apparently perform separate functions as 
evaluation devices. Thus, to insure that 
multiple course outcomes, in line with the 
objectives of instruction, are achieved, it 
becomes necessary to design evaluation 
instruments to accomplish these separate 
functions. These results tend to agree with 
the results of previous studies done by 
Tyler (13), McConnell (9), Johnson (7), 
Brown (3), Horrocks (6), and Bedell (1), 
all of which make it apparent that the 
achievement of one objective cannot be 
inferred from the achievement of another. 
Remmers has expressed this point when 
he concluded (11, p. 31): “... the edu- 
cator must clearly define each objective 
in terms of the measure of its attainment. 
The attainment of a particular objective 
cannot be inferred from measured attain- 
ment of another objective.” 

The majority of studies reported in the 
literature suggest much of what is learned 
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in school is forgotten. It has long been the 
concern of educators to provide learning 
experiences of more permanent value. 
From this standpoint, the results of this 
investigation suggest that increased em- 
phasis on some of the higher levels of 
understanding such as interpretation and 
extrapolation will lead to more economical 
learning. As the authors have defined the 
objectives in the “Taxonomy,” each higher 
level of intellectual ability is built on and 
includes the previous levels. To emphasize 
such abilities as interpretation and ex- 
trapolation means that the possession of 
knowledge and the ability to translate 
it will be a part of the learning experience, 
but that understanding will go beyond 
these lower levels of intellectual endeavor 
and involve mastering more permanent 
abilities and skills. Such practices have 
not always been the case. Tyler (13) 
found that interviews with college stu- 
dents indicated that more than 60% of 
the students in college believe their chief 
duty is to memorize information. Tyler 
stated that the emphasis given to recall 
of fact in the typical college examination 
is one of the chief reasons for the exis- 
tence of this belief. 

It has been previously shown in studies 
done by Tyler (12), Wert (14), and 
Frutchey (5) that such outcomes as the 
ability to apply principles to new situa- 
tions and interpret new experiments dem- 
onstrated much higher degrees of per- 
manency than abilities involving only the 
recall of specifics. The results of the pres- 
ent experiment agree in general with what 
has been previously done. It also suggests 
that such a device as the “Taxonomy” 
will enable us to do a far more systematic 
and communicable job in studying dif- 
ferent outcomes of instruction. 


SuMMARY 


The purpose of this experiment was to 
study the differential retention of certain 
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course outcomes in a beginning educational 
psychology course. 

Tests were constructed to measure four 
different behavioral outcomes in the con- 
tent area of tests and measurements. These 
outcomes were: (a) knowledge, (6) trans- 
lation, (c) interpretation, and (d) ex- 
trapolation. As a result of a homogeneity 
analysis of the behaviors measured by 
these different tests, it was found de- 
sirable to combine the interpretation and 
extrapolation tests in that they seem to 
be performing a similar measurement func- 
tion. 

The tests were administered as a pre- 
test before the units on tests and measure- 
ments were studied, at the completion of 
the units, and a third time after approxi- 
mately four months had elapsed. The re- 
sults of the study of retention indicated 
that the abilities to interpret and extrapo- 
late were retained to a significantly greater 
degree than the ability to recall knowledge 
or translate this knowledge from one form 
to another. It was concluded that there 
was differential retention among the be- 
havioral objectives measured. 
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ATTITUDES TOWARD SCHOOL OF HIGH SCHOOL PUPILS 
FROM THREE INCOME LEVELS 


JOHN K. COSTER 
Department of Education, Purdue University 


Educators have become increasingly 
concerned with the relationship between 
social status and educational outcomes. 
Numerous studies of this relationship have 
been reported. The findings have demon- 
strated that social status is related to 
practically all educational experiences. 

The relationship between social status 
and intelligence and achievement has been 
emphasized in these studies (7, 14). Other 
studies have involved personality (2, 6), 
extracurricular activities (9, 15), social 
acceptance (1, 11), honors received (1), 
attitudes (8, 10), and morale (3). Re- 
sults usually indicate that upper status 
pupils exceed lower status pupils on 
achievement and intelligence test scores, 
marks in school, adequacy of adjustment, 
number of activities, social relationships, 
and attitudes toward school. The differ- 
ences are generally statistically significant. 

In the present study, the relationship 
between specific attitudes toward school 
and level of income was investigated. The 
purpose was to ascertain on which of a 
number of attitudinal items pupils varied 
in their responses when they were divided 
into three income groups. 


PROCEDURE 


A questionnaire, containing a morale 
scale and a “house and home” scale, was 
administered to approximately 3,000 pu- 
pils in nine central and south central 
Indiana high schools. The morale scale, 
constructed as part of another study (3), 
contained 27 attitudinal items. The items 
pertained to the school, teachers, school 
program, other pupils, and the value of 
education. Each item in the scale was 
stated as a question, and was followed by 


a list of five possible responses. The re- 
sponses reflected (a) a very favorable at- 
titude, (6) a favorable attitude, (c) a 
neutral (neither favorable nor unfavor- 
able) attitude, (d) an unfavorable atti- 
tude, and (e) a very unfavorable attitude. 
Following is an example of a typical item 
and list of responses. 

Item: What is your general opinion of 
the other boys and girls in your high 
school? 

a. They are the best group of boys 
and girls in the world! 

b. I feel that we have a good group 
of boys and girls in our high school. 

—+c. Some of the other students are 
all right; some are not. 

—d. I feel that this high school has 
a poor group of boys and girls. 

—e. They are the worst group of 
boys and girls in the world! 

Pupils were instructed te check the re- 
sponses with which they agreed most 
closely. 

An indication of income level was ob- 
tained from a “house and home” scale. 
This seale listed seven things either found 
in the home or provided for the pupil. 
The items were a vacuum cleaner; an 
electric or gas refrigerator; a bath tub or 
shower with running water; two automo- 
biles (excluding trucks) ; lessons in drama, 
art, expression, dancing, or music pro- 
vided outside of school; an automatic 
dishwasher; and a cabin or cottage for 
vacations. Pupils checked the items which 
applied to them. 

The “house and home” scale has been 
used extensively by Remmers and others 
(12) in the Purdue Opinion Panel stziiies 
to divide pupils into income groups (e€.g., 
10). Elias (5) and Remmers and Kirk 
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TABLE 1 


NUMBER AND PERCENTAGE OF PurpiILs WHo CHEcKED ITEMs ON House aNp Home ScALe, 
AND NUMBER AND PERCENTAGE OF PupPpILs In Eacu Income Group 








Number and percentage of pupils checking 





Number of 
items checked N % 


N % 


Income Group 
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270 
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83 
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878 100.0 





(13) have reported on the validity of the 
seale. 

A sample of 878 cases was selected from 
the returns. The sample included 100 
questionnaires, selected randomly, from 
the six larger schools, and all useable 
questionnaires from the three smaller 
schools. 

Based on the number of items checked 
on the “house and home” scale, Ss were 
divided into three income groups. The 
high income group included Ss who 
checked five to seven items. The middle 
income group included those who checked 
three or four items. And the low income 
group included those who checked two 
or fewer items. The number and per- 
centage of pupils who checked each num- 
ber category, and the number and per- 
centage of pupils in each income group are 
shown in Table 1. 

Responses to each attitudinal item were 
tabulated by income group. The responses 
were then combined into two categories. 
One group included favorable and very 
favorable responses. The second group in- 
cluded all other responses. 

For each item, the following null hy- 
pothesis was postulated: There is no dif- 
ference in the responses of pupils of varied 
income groups. Each of the 27 hypotheses 
was tested by the chi-square technique. 
The tests were based on a series of 2 X 


3 contingency tables. The combination of 
responses provided a uniform series of 
tables, with a minimum expected frequency 
of five in each cell. 


REsvuLts AND Discussion 


The results of the chi-square tests are 
given in Table 2. The table also shows 
the percentage of pupils, by income group, 
who checked favorable and very favorable 
responses. The item-questions were 
abridged to conserve space in the table. The 
column headed “P” indicates the probabil- 
ity level associated with chi-square values. 

The items were divided into seven 
groups to facilitate interpretation: (A) 
Attitudes Toward Teachers, (B) Attitudes 
Toward the School, (C) Attitudes Toward 
School Program, (D) Attitudes Toward 
Appropriateness of School Work, (E) 
Attitudes Related to Future Expectations, 
(F) Attitudes Related to Social Accep- 
tance, and (G) Miscellaneous Attitudes. 
The letters are used in designating the 
items in Table 2. 

The data show that responses varied 
significantly on relatively few items. Only 
eight of the 27 hypotheses could be re- 
jected, six at the 1% level and two at 
the 5% level. The responses among groups 
differed widely, ranging from practically 
no variation to extremely significant varia- 
tions. The frequencies generally varied 
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TABLE 2 


Resvu.ts or Tests oF SIGNIFICANCE SHowING PERCENTAGE OF Pupits CHECKING VERY 
FAVORABLE AND FavoraBLe Responses, BY Income Groups 











| Income Group 


* 








A-l What is your opinion of your high school | 
teachers? 


OD 


A-2 | Do your teachers treat you fairly?... 
A-3 | Are your teachers personally interested in 
you? 
A-4 | Do your teachers “know” and understand | 
their subjects? 
A-5 | How well are your subjects taught? 
A-6 | Do your teachers help you sufficiently with | 
your school work? : 
A-7 | Would you ask adults in your school for help 
with personal problems? ; 
B-1 | What is your general opinion of your high 
school? 
B-2 | How well is your school organized? 
B-3 | How satisfactory are the working and study- 
ing conditions? 
B-4 | How satisfactory are the equipment and fa- 
cilities? ; 
B-5 | How satisfactory is the grading system? 
B-6 | What is your opinion of the school spirit in 
your school? 
C-1 | What is your opinion of the group of subjects 
your school offers? 
C-2 | What is your opinion of the number of activi- 
ties in your school? | 
D-1 | Is your school work the kind of work | you like 
to do? ° 
Is your school work interesting? 
Will your school work be useful after you 
leave school?. . 
Will going to high school help you get more 
satisfaction from living? 
What are your chances of getting the ” you 
want after high school ?. ‘ 
Are you satisfied with your social life in high 
school ? PLE a on 
Do the other students like you? re . 
How well do other people in your school treat 
you?.. . 
What is your opinion of the other boys and 
girls in your school?... : 
Are your parents interested in your high 
school work?. 
How do people in your community feel about 
your high school?.. - 
How hard are you working or studying i in 
high school? é ate 
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greatest on items which involved inter- 
personal relationships. And they appeared 
to vary least on items which the pupils 
could consider objectively, with limited 
emotional attachment. 

Significant variations were noted for 
three of the four items in the social 
acceptance group (Group F). Low in- 
come pupils reacted less favorably than 
other pupils to their social life (Item 
F-1), to being liked by other pupils (F-2), 
and to other pupils (F-4). According to 
unpublished data (4), high and middle 
income pupils significantly exceeded others 
in the percentage who associate with fel- 
low pupils outside of school. Low income 
pupils were more likely to associate with 
youth from other schools or not in school. 

The attitudes on social acceptance 
seemed to be related to other attitudes on 
which differences in responses were ob- 
served. Low income pupils apparently are 
not as sure of parental interest in school 
work as other pupils (G-1). Whereas prac- 
tically all high income pupils indicated 
that they felt that their parents were in- 
terested in their work, only three fourths 
of the low income pupils expressed similar 
opinions. These differences were highly sig- 
nificant, with P < .001. 

Low income pupils also differed from 
other pupils in their estimates of the per- 
sonal interest of their teachers (A-3). The 
differences were significant at the 5% level. 
This item was the only one of the seven 
items pertaining to teachers on which re- 
sponses varied significantly. 

The item on general impression of the 
high school (B-1) was the only item re- 
lated to teachers, school, school program, 
appropriateness of work, and value of edu- 
cation on which differences were significant 
at the 1% level. In view of the homoge- 
neity of other items, it would seem that 
responses to this item were affected more 
by the nature of relationships than by the 
nature of the school and school program. 


Responses varied widely to the item on 
future employment (E-3). Over two thirds 
of the high income pupils, as compared 
with less than one half of the low income 
group, expressed favorable responses about 
getting the kind of job they want. Differ- 
ences were significant at the 0.1% level. 
This item may be related to post high 
school educational aspirations. It was 
found that one half of the high income 
pupils in the sample plan to go to college, 
as compared with less than one sixth of 
the low income group (4). 

Except for the two items mentioned pre- 
viously, responses to items on teachers 
(Group A) and school (Group B) varied 
slightly or not at all. Pupils in the three 
income groups were virtually in complete 
agreement on items related to the tech- 
nical operation of the school and the tech- 
nical competency of teachers. 

The responses to items on school pro- 
gram (Group C), appropriateness of 
school work (Group D), and the value of 
education (E-1 and E-2) generally varied 
more than the responses to items on 
teachers and the school. The responses of 
high income pupils were more favorable 
for five of the six items in these groups, 
but, except for item D-2, variations were 
not significant. High income pupils were 
more interested in their school work than 
others (D-2), and differences were signifi- 
cant at the 5% level. The Ss responded 
uniformly to the number of activities in 
the school (C-2), even though low income 
pupils participated in significantly fewer 
activities (4). The pupils in all groups re- 
acted favorably to the value of education. 
The low income pupils, however, reacted 
more favorably to the utility value of edu- 
cation (E-1) than to the enrichment value 
(E-2). 

The low income pupils differed from 
others—but not significantly—on estimates 
of how people in their communities felt 
about their high schools (G-2). And on the 
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question of how hard pupils were working 
in high school (G-3), no variation among 
groups was observed. 


CONCLUSIONS 


The data seem to support the following 
conclusions: 

1. Responses of pupils of different in- 
come levels were more likely to vary on 
items related to interpersonal relation- 
ships than on items which involved an ob- 
jective appraisal of the school or the 
school program. 

2. The schools in the study have pro- 
vided an educational program uniformly 
accepted by pupils of the three income 
levels. They have been less successful in 
integrating all pupils into the social struc- 
ture of the school. How acceptance may be 
gained for all pupils is undoubtedly a 
perennial problem. 

3. The low income pupil is less likely to 
enjoy strong parental interest and support 
than other pupils. An immediate, prac- 
tical problem confronting the schools, 
therefore, is stimulating interest of all 
parents in school and school work. 

4. Variations in estimates of possible 
satisfactory future employment among 
pupils of varied income levels suggest that 
more attention should be given to helping 
noncollege, low income pupils select, pre- 
pare for, and enter an appropriate voca- 
tion. 


SuMMARY 


When 878 pupils from nine Indiana 
high schools were divided into three in- 
come groups, it was found that they re- 
sponded similarly to attitudinal items on 
school, school personnel, school program, 
and the value of an education. The re- 
sponses varied significantly with income 
level, however, on items related to inter- 
personal relationships. The items on which 
differences were observed pertained to so- 
cial life, being liked by other pupils, opin- 


ions of other pupils, feelings of parental 
interest in school work, and personal 
interest of teachers. Although pupils re- 
sponded uniformly on specific items per- 
taining to the school, they varied signifi- 
cantly, according to income level, in their 
general impression of their schools. They 
also varied significantly in their estimates 
of being able to get the kind of jobs they 
want after they leave school. 
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The College Entrance Examination 
Board undertook in 1951 a study to ex- 
plore the effectiveness of a series of new 
aptitude tests that might prove to be con- 
tributive supplements to the Scholastic 
Aptitude Test or effective substitutes for 
parts of it. 

Validity studies of the SAT itself are 
carried out routinely. Substantially all of 
these use as criteria the grades received 
during freshman year. This has been done 
mainly because of the great delay encoun- 
tered in waiting for the longer-term cri- 
teria. Furthermore, while students take a 
considerable variety of courses in fresh- 
man year, their freshman programs are 
much more alike than their upperclass 
programs. For this reason “average fresh- 
man grades” may be not only more 
quickly available but also more meaning- 
ful than average grades received when the 
students are working in different subject- 
matter areas having different degrees of 
difficulty. 

It is useful to consider some hypotheses 
for the change that might occur in the 
validity of an aptitude test between fresh- 
man and upperclass years in college. The 
following conditions should lead to a de- 
crease in the validity of aptitude tests: 


1. Seniors take more varied courses than 
freshmen; success in the various courses, 
some easy and some difficult, will be hard 
to predict. 

2. Time between testing and the measure- 
ment of the criterion allows more scope 
for changes to take place in the individual 
students as a result of different experiences 
or different rates of maturation. 

3. Attrition at college cuts down the 
range of ability between freshman and 
senior years. 


Conditions possibly leading to an increase 
in validities are as follows: 
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1. A lack of adequate adjustment to col- 
lege life in freshman year might introduce 
extraneous influences on scholastic success. 

2. More uniformly high motivation and 
& more serious attitude toward work in up- 
perclass years may cut down one source of 
extraneous variance. 

3. Emphasis on memory work in fresh- 
man year may depend on motivation or 
other factors, while the understanding and 
problem solving required in upperclass 
years may depend more upon the aptitudes 
measured by most test scores. 


It is not easy to guess at the resultant of 
such factors as these. 

There have been very few studies where 
validities of High School Record and of 
College Board and other tests for fresh- 
man grades have been compared with 
validities for four-year grades. Studies by 
Dwyer (2), Brush (1), and Frederiksen 
(3) have shown, in general, that four- 
year cumulative average validities do not 
differ consistently from freshman validi- 
ties. Findings in the present study gen- 
erally confirm this conclusion. 

Even less has been done in validating 
the SAT against major field grades. An 
unpublished study carried out at Stan- 
ford University (6) shows validities of 
the SAT and high school record for cumu- 
lative average and major-field grades. The 
superiority of the high school record in 
that study and the sex differences found 
for the validity of the SAT for Social 
Science grades are not confirmed by find- 
ings in the present study. 


THe EXPERIMENTAL TEsTs 


In addition to the High School Record, 
SAT-V (verbal), SAT-M (mathematical), 
and the CEEB English Composition Test, 
the measures investigated in this study 
consisted of 11 newly adapted or newly 
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developed aptitude tests, some with part 
scores. Descriptions of the tests and re- 
liabilities by the Kuder-Richardson for- 
mula No. 20 are as follows: 


1. Social Studies Reading. A 1,000-word 
passage by Hamilton concerning the Bill of 
Rights with questions on interpretation, vo- 
cabulary in context, and the structure of 
the passage. 25 four-choice items, 25 min- 
utes. Reliability, .66. 

2. Science Reading. A 1,200-word essay 
on “A Piece of Chalk” by Huxley with 
questions on interpretation, vocabulary in 
context, and the structure of the passage. 
25 four-choice items, 25 minutes. Relia- 
bility, .71. 

3. Inductive Reasoning. A spiral omni- 
bus test using items drawn from verbal, 
nonverbal, arithmetic, science, and social 
studies materials. The items were of three 
types found to measure inductive reasoning: 
analogies, series, and categories or “belong- 
ing” items. The great variety of item types 
was introduced so that the subjects could 
not develop a uniform approach or uniform 
method of solution, which would render the 
test deductive instead of inductive. 65 items, 
25 minutes. Reliability, 82. 

4. Integration. This test was similar to 
conventional “artificial language” tests ex- 
cept that the rules for translation were more 
complex, and there was no premium on 
quick memory. It was developed as a test 
of one of the factors called “integration” 
in the Army Air Force Aviation Psychology 
Program (4). This is the ability to under- 
stand and follow complex directions. 15 
items, 25 minutes. Reliability, .73. 

5. Sufficiency of Data. Each problem 
consisted of a question followed by two 
mathematical or quantitative facts. The 
task was to decide whether either fact, both 
together, both separately, or none were suf- 
ficient to answer the question. 30 problems, 
25 minutes. Reliability, 80. 

6. Data Interpretation. This test con- 
sisted of statements related to the content 
of two sets of data: a table on the expendi- 
tures of state governments and a verbal 
exposition of a research concerning enlarge- 
ment of the thyroid gland. The task was to 
decide whether the data were sufficient to 
make each statement true, probably true, 
false, probably false, or none of these. 30 
items, 25 minutes. Reliability, .68. 

7. Visualization. Drawings indicated how 
a square sheet of paper was folded and then 


punched one or two times. The task was to 
select from five drawings the one that 
showed how the paper would look when 
opened. 20 items, 25 minutes. Reliability, 
85. 

8. Best Arguments. Situations involving 
some sort of dispute were described in a 
paragraph. Subjects select one or two state- 
ments constituting the best argument for 
each side. Four situations totalling 21 items, 
25 minutes. Reliability, very low. (K.R. 20 
was not applicable, because the items were 
not independent from each other.) 

9. Perceptual Speed and Carefulness. The 
two parts of this test each contributed to 
the measurement of Perceptual Speed and 
Carefulness. (a) Cancellation. A page of 
random capital letters typed close, lines 
single spaced, and reproduced in red. The 
task was to draw an X over every A. Three 
minutes were allowed. (b) Picture Discrimi- 
nation. Each item consisted of three simple 
drawings of a face, two exactly alike, and 
one different in some respect. Three minutes 
were allowed. The score for Perceptual 
Speed was the number of A’s cancelled plus 
the number of faces correctly marked. Re- 
liability, 94. The Carefulness score was the 
inverse of a score developed by adding 
omissions on Cancellation to five times the 
wrongs on Picture Discrimination. (This 
scoring formula operated to weight the two 
parts equally in the total score.) Reliability, 
55 


10. Memory. This test had 3 parts scored 
as separate variables for validation pur- 
poses: (a) Picture Memory. A picture of a 
Venetian palace was studied for five min- 
utes. Later a second picture was presented 
showing the same palace with some features 
changed. The students were allowed five 
minutes to answer 30 true-false questions 
comparing the pictures. (b) Verbal Memory. 
A one-page description of the peoples of 
Honduras was studied for five minutes. 
Later the students were allowed five min- 
utes to answer 30 true-false questions about 
the passage. (c) Number Memory. Some 
prices and inventory numbers in department 
stores were studied for five minutes. Later 
the students were allowed five minutes to 
answer 15 five-choice questions calling for 
recognition of the memorized numbers. The 
memory portion of each of these parts took 
place during the initial 15 minutes of the 
two-hour testing session, and the response 
portions took place during the last 15 min- 
utes with one and a half hours of other testing 
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coming in between. Reliabilities for the 
three parts were respectively .74, 54, and 
69. 

11. General Information. This test pre- 
sented five-choice factual information items 
drawn from various fields with the intention 
of measuring interest in those fields. The 
items were selected so as to avoid informa- 
tion that would be acquired in school, but 
to include information that would be gained 
through hobby work or incidental reading, 
presumably of the student’s own choosing. 
The scores (number of items answered cor- 
rectly out of the 15 for each of seven fields) 
were treated as separate variables in the 
validation study. The fields included were: 
(a) Art and Architecture, (b) Literature, (c) 
Soctal Work, (d) Government, (e) Biological 
Science, (f{) Physical Science, and (g) Me- 
chanical. Total items 105; total time 40 
minutes. Reliabilities for the parts were re- 
spectively 53, 52, 30, 52, 52, 56, and 52. 


ADMINISTRATION OF THE TESTS 


Ten liberal arts colleges, all of which 
require the SAT for entrance, participated 
in the study by scheduling two hours of 
testing for their entering freshmen and 
by supplying all course grades and some- 
times the high school record. Four tests 
were administered at each of the colleges 
by combinations taken so as to provide a 
substantial number of cases for the most 
interesting of the intercorrelations. Ex- 
cept for Perceptual Speed and General 
Information, the tests were 25 minutes in 
length, a half hour including administra- 
tion time. Perceptual Speed and General 
Information were always given together 
as they formed a one-hour unit. Other- 
wise, the order of administration of the 
tests was varied from college to college. 
All administrations took place in the fall 
of 1951. 


Tre CRITERIA 

The data used were found on the tran- 
scripts of the students’ college records or 
on supplementary material provided by 
the colleges. Descriptions of the criteria 
follow. 

Cumulative college average. This was 
the over-all college grade-point average. 


Many different marking systems were 
represented, but it was not necessary to 
convert all of these to a common scale, 
because separate correlation studies were 
undertaken for each college. The Cumu- 
lative Average was computed for all stu- 
dents who had completed at least a half 
year of work. Inclusion of students who 
did not finish college introduces an im- 
purity into this criterion, because the 
grades are not earned in all years of col- 
lege on the same basis. For example, it is 
somewhat easier to earn high grades in 
senior year than it is in freshman year. 
However, to have failed to include non- 
graduates in the cumulative average would 
have sharply reduced the number of cases 
in the study and might have eliminated 
the part of the range of test scores and 
grades that is of most interest to admis- 
sions officers. 

Freshman grades. The freshman grade 
average was computed in the same way as 
the cumulative average. Freshman grade 
averages in specific course areas were also 
computed. 

Major-field grades. The major-field 
grade was computed from the grades in 
the major-field courses taken at the par- 
ticipating college during junior and senior 
years. This criterion was computed for 
graduating students only. To simplify the 
tables given in this article and to increase 
the stability of the figures, the major 
fields were grouped into three groups: 
science and mathematics, social science, 
and humanities and languages. In all cases 
validity coefficients were computed sepa- 
rately for the individual major fields and 
were averaged by using z transformatibns 
and weighting by number of cases." 


Graduation-nongraduation. This was 


‘For graduating students, comprehensive 
examination grades in the major field were 
available for 6 out of 10 of the participat- 
ing colleges. However, the validity pat- 
terns were found to be so much like those 
for the major fields that data on this cri- 
terion have been omitted from this article. 





70 JOHN W. FRENCH 


the simple dichotomy, graduation vs. non- 
graduation. 


RELIABILITY OF THE CRITERIA 


For some of the colleges the grade 
average, or the major-field grades, or both, 
were computed separately by college year. 
This was done to provide a spot check 
on the estimated alternate-form reliabil- 
ity of these averages. To the extent that 
motivational or other factors change con- 
ditions during the course of the four 
years, the correlations are lowered. There- 
fore, the interyear correlations represent 
underestimations of the alternate-form 
reliability. All of the interyear correla- 
tions are based on graduating students, 
because relatively few of the nongraduates 
had a college record beyond freshman 
year. 

The average of the interyear correla- 
tion figures for average grades (computed 
with z transformations) was .71. Since the 
separate single years can be considered to 
be alternate quarters of the criterion, it 
is appropriate to apply the Spearman- 
Brown formula to estimate the reliability 
of the four-year cumulative average. The 
corrected figure would be about .91. Fur- 
thermore, since the interyear correlations 
were computed for graduating students 
only, it is also reasonable to consider a 
correction for restriction of range in or- 
der to estimate the reliability of the cu- 
mulative average, which is used in this 
report for all students whether or not 
they graduated. The standard deviation 
of the cumulative average for graduating 
students was found to be on the average 
about 25% less than that of the cumula- 
tive average for all students. The correc- 
tion for restriction of range would raise 
the reliability figure still farther. No at- 
tempt will be made to compute the exact 
correction, because corrections from .71 
up to that level are subject to considerable 
distortion. It is clear, however, that the 
reliability of the cumulative average 


compares favorably with that of long, 
well-made aptitude tests. At the same 
time, it is well to remember that the “re- 
liability” in some colleges might be partly 
a result of “halo.” In addition, there are 
other aspects of grades such as prompt- 
ness or neatness which may give them 
high consistency without necessarily re- 
flecting consistent evaluation of achieve- 
ment that is considered important. 

For major-field grades, correlations be- 
tween junior-year and senior-ycar grades 
were used. These were found to be .65, 
.75, and .71 for the three areas respec- 
tively. No correction for restriction of 
range is applicable. However, since the 
junior and senior years may be considered 
to be alternate halves of the major-field 
grade criterion, the Spearman-Brown for- 
mula may be used in arriving at estima- 
tions for the reliabilities of the two-year 
major-field average. The corrected figures 
were .79 for science and mathematics, 86 
for social science, and .83 for humanities 
and languages. Since the best validities 
reported here or elsewhere do not ap- 
proach the limit made possible by these 
reliabilities, the theoretical best possible 
prediction of college grades is still far 
away. 
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Findings With Regard to Average Grades 
and Graduation 


For the cumulative average, Table 1 
summarizes for the ten colleges the validi- 
ties of the SAT, High School Record, 
CEEB English Composition Test, and 
the experimental tests. To a considerable 
degree the validities are comparable from 
college to college. The only real exception 
to this is the large size of the validities of 
the experimental tests at College J. Com- 
parisons among the validities of the tests 
are commented upon in a later paragraph 
when results from the severe! colleges are 


pooled. 
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TABLE 1 
Summary Tasue or VaLipiTiEs FoR CUMULATIVE AVERAGE _ 





Variables 


N= |N=(|N=|N= 
449 | 172 | 154) 870 


SAT-V 45 
SAT-M .32 | 
High School Record .39 
English Composi- 
tion Test 
Social Studies 
Reading 
Science Reading 
Inductive Reason- 
ing 
Integration 
Sufficiency of Data 
Data _Interpreta- 
tion 
Visualization 
Best Arguments 
Perceptual Speed 
Carefulness 
Picture Memory 
Verbal Memory 
Number Memory 
Art Information 
Literature  Infor- 
mation 
Social Work Infor- | 
mation 
Government Infor- 
mation 
Biology 
tion 
Physical Science 
Information 
Mechanical Infor- | 
mation 


Informa- | 


—.08 | 


.23 
25 


88 & i 


° 





Table 2 shows the results pooled for all 
colleges; that is, averages across colleges 
have been computed. This makes conven- 
ient the comparisons among test validities 
for nine criteria: freshman average, cu- 
mulative average, graduation-nongradua- 
tion, and freshman and major-field aver- 
age in each of the three areas. 

Some of the experimental tests, par- 
ticularly when their short length is con- 
sidered, have substantial validities for 


cumulative average. A discussion of com- 
parisons among the test validities, how- 
ever, will be more appropriate in the next 
section where statistical corrections for 
restriction of range and for test length 
are applied. 

Since freshman grades constitute part 
of the cumulative average, some similarity 
of validity coefficients for these two cri- 
teria is to be expected. However, the ex- 
treme closeness of the figures in the first 
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TABLE 2 
VALIDITIES FOR ACADEMIC C RITERIA AVERAGED Over COLLEGES 








| 
| 

.| Cum. 
| 


Variables 
avg. 





SAT-V 

SAT-M 

High School Record 

Social Studies Reading 

Science Reading 

Inductive Reasoning 

Integration 

Sufficiency of Data 

Data Interpretation 

Visualization 

Best Arguments 

Perceptual Speed 

Carefulness 

Picture Memory 

Verbal Memory 

Number Memory 

Art Information 

Literature Information | 

Social Work Informa- 
tion 

Government Informa- 
tion 

Biology Information 

Physical Science 
Information 

Mechanical — .02 

Information 


two columns of Table 2 shows that the 
tests that are valid for freshman grades 
are valid to much the same degree for 
upperclass grades. There is a very slight 
tendency for the cumulative average va- 
lidities to be lower than the freshman 
validities, but the change is so slight as 
to be of no practical importance. The 
lack of substantial change in the size of 
the validity coefficients suggests that the 
factors favoring downward or upward 
changes listed earlier in this article either 
are not operative or approximately bal- 
ance each other. These findings also sup- 
port the viewpoint that for use in validity 
studies the freshman grade average is a 
satisfactory substitute for the four-year 
cumulative average. 


Grad.- 


| Humanities 
& Lang. 


Social 
Science 


Science & 


[Major | Fresh | | Major Fresh. (Major 


.35| .43| .43| .39 
.34| .20| .26 | 18 
27 | .34| .29| .37 
29| .40| .36| .31| 
21} .30) .31] .17] 
19| .24) .10] .39 
.23| .16] . 2B 
42) .25| .13| .21] 
.1s| .23] .29] .22 
.19 OS t .06 
08 | .16/ .12| .12 
06} .07| .12] .mf 
.04 |— .038 ‘ .00 | 
08} 14) .12] .05 
07 23 A 18 | 
| 4}. 01 | 
23 | .27| .23] .22 
2%} .31| . 2 
24 | .22] 24] .16 


wivviise 


S=SeRZRSSR2SR 


wig 


4| .32] .33| .26 


20) .19] . 10 | 
7} 15) 12) .12] 


| 


' 


| 
— .06 |-. | -03 |-. 
| 


| 
| 
| 
| 
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The average validities for graduation- 
nongraduation are also given in Table 3. 
Apparently none of the tests in this study 
have an appreciable relationship to grad- 
uation-nongraduation, and high school 
record has very little. Before attempting 
to interpret this finding, it will help to 
look at the relationship of this criterion 
to grades. 

The correlations between graduation- 
nongraduation and grades were found for 
the 10 colleges to range from .20 to 53. 
The weighted average is 44. However, 
these correlations are partly accounted 
for by an artifact of the situation. A check 
of the available data confirms what is, 
perhaps, a well known fact that, for those 
students who reach senior year, grade 
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TABLE 3 
UNCORRECTED AND CoRRECTED VALIDITIES FoR CUMULATIVE AVERAGE 








College Ai 
(N = 449) 


College C 
(N = 870) 


College I 
(N = 579) 





Un- 
corrected 


Lcessested 


Un- 
corrected 


Un- 


esssacted Corrected 


Corrected 





45 
-32 
45 
-32 
-28 


SAT-V (90 min.) 
SAT-M (60 min.) 
SAT-V (25 min.) 
SAT-M (25 min.) 
Science Reading 
Soc. Stud. Reading 
Data Interpretation 
Sufficiency of Data 
Integration 

Best Arguments 
Literature Info. 
Government Info. 





.32 





.36 
.40 


54 
4l 
51 
.39 

37 


-42 


.53 
.60 


44 54 .58 
31 40 45 
44 51 55 
31 .38 42 
-38 47 
46 


24 -32 





37 
41 


57 
.62 











15 -28 
-25 51 
35 .62 





averages are higher in senior year than 
in freshman year. By assuming that the 
students do not work any harder in their 
senior year, it can be argued, then, that 
the grading system changes; high grades 
are easier to get in senior year. Since 
most of the nongraduating students only 
received grades early in their college ca- 
reers, when good grades were most diffi- 
cult to earn, the correlation between grade 
average and graduation-nongraduation 
would almost certainly be higher than the 
actual relationship between graduation- 
nongraduation and scholastic success. One 
measure of this actual relationship is the 
correlation between freshman grades and 
graduation-nongraduation. In this study 
the correlation between graduation-non- 
graduation and cumulative average was 
found to be 46 at College B and .30 at 
College J, while the same figures for fresh- 
man average were only .25 and .15, re- 
spectively. These lower figures may be too 
low because of the elapse of time between 
freshman year and the time when many 
students withdraw, but they probably give 
a truer picture of the correlation between 
scholastic success and graduation-non- 
graduation than do the figures for cumu- 


lative average. This correlation is low, 
because so many things other than grades 
can cause a student to withdraw from 
college. 

The implication of the still lower rela- 
tionship between test scores and gradua- 
tion-nongraduation seems to be that none 
of the colleges participating in this study 
admitted many students whose aptitude 
as measured by tests was so inadequate 
as to lead to either voluntary withdrawal 
or dismissal. This was true even for col- 
leges E, G, and J, where the SAT statis- 
tics indicate that little or no selection oc- 
curred. On the other hand, to the small 
extent that grades do correlate with grad- 
uation-nongraduation, it may be said that 
withdrawal or dismissal occurs when stu- 
dents underachieve, that is, get grades 
which are lower than would be expected 
from their test scores. It is possible, of 
course, that the desire to leave college 
comes first for nonscholastic reasons and 
is followed by a drop in grades. In any 
case, the clear finding is that graduation- 
nongraduation does not serve as a pre- 
dictable criterion against which it is pos- 
sible to validate the kinds of tests tried 
out in this study. 
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Application of Statistical Corrections 


In order to make appropriate compari- 
sons between the validities of the tests, 
it is necessary to apply corrections for 
restriction in range on the SAT and for 
variations in the lengths of the tests. Un- 
fortunately, it is often misleading to make 
statistical corrections, but it can be even 
more misleading to do without them. For 
this reason, figures for the reliability of 
the criteria have already been given both 
with and without corrections. Figures for 
the validities of the tests were given in 
Table 3 without corrections. Some of these 
validities will now be presented with cor- 
rections. 

Corrections for restriction in range 
compensate for the high degree of selec- 
tivity employed by some of the partici- 
pating colleges. The corrections alter the 
validity coefficients so as to equal the 
values which would have been obtained if 
the range of scores on which the validities 
were observed had been equal to that for 
the entire SAT candidate population on 
the date of the testing in March 1951. At 
this administration the standard deviation 
of SAT-V for the candidate population 
was 113, and that for SAT-M was 110. 

Correction for test length was made by 
the Spearman-Brown formula after cor- 
rection for restriction of range had been 
accomplished. The validity coefficients 
were corrected to simulate their value had 
every test been of “practical length,” de- 
fined as 10 minutes for Perceptual Speed 
and 25 minutes for all others. 

Either to average the validities and 
then apply two kinds of corrections or to 
apply the two corrections and then to 
average the corrected figures seemed to 
be covering up the observed validities 
with too much statistical folderol. There- 
fore, it was chosen not to do any averag- 
ing where corrections were made, but to 
select a few individual college findings 
with which to illustrate the effects of the 


proper corrections. The sample findings to 
be used for this purpose concern one cri- 
terion, the cumulative average, three 
colleges, A,, C, and I, and a selection of 
variables including SAT, ECT, High 
School Record, and four experimental 
tests at each college. The colleges selected 
were the three largest except that College 
H was avoided, because only relatively 
unsuccessful tests were administered there 
(see Table 1). The tests selected for each 
college were those whose average validities 
for all other colleges were the highest. The 
only exception to this rule was a limit of 
two set upon the number of information 
tests selected. This selection technique, 
which was considered to be the equivalent 
of a cross-validation, led to the selection 
of the same two information tests for all 
three colleges: Literature Information and 
Government Information. The other tests 
were necessarily different at each college, 
since none of them was administered to 
more than one of the selected colleges. 

Table 3 gives the selected data from 
Colleges A,, C, and I. For each college 
the first column gives the observed cor- 
relations. The second column gives the 
same correlations after corrections were 
made for restriction in range on SAT-V 
and SAT-M and for test length. 

It is apparent that, after the corrections 
are made on these data, neither the SAT 
nor the High School Record (College I) 
stand supreme as predictors. The highest 
validity is for Government Information. 
This probably reflects the importance to 
the criterion of width of serious reading 
outside of school requirements. The valid- 
ity may be as high as it is because the 
student who abounds in this kind of in- 
formation would probably possess both 
the aptitude measured by SAT-V and the 
willingness to spend time in serious extra 
study that produces a good high school 
record. While something more complex 
than breadth of information may be the 
desirable outcome of a college education, 
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it is, nevertheless, undeniably true that 
the students who can demonstrate a wide 
knowledge of facts are often the ones 
who can think most clearly and are cer- 
tainly the ones who fill up the academic 
honor roll. The runner-up tests, Literature 
Information, SAT-V, English Composi- 
tion Test, and Social Studies Reading, all 
confirm the importance of serious reading 
to college grades. 


Findings With Regard to Grades in Spe- 
cific Areas 


Table 2 compares the test validities for 
grades in three major-field areas with va- 
lidities for freshman courses in these areas. 
As in the comparison between freshman 
average and cumulative average, there is 
evident here only a very slight drop in 
most validity coefficients between the 
freshman and upper-class years. In these 
tables the freshman and upper-class cri- 
teria do not overlap as they did in the 
case of freshman and cumulative average. 
The major-field grade criteria were aver- 


ages of the appropriate course grades 
earned during the junior and senior years. 
In spite of this lack of overlap, the major- 
field validities have much similarity to 
the freshman validities. Here again the 
findings encourage the practice of using 
freshman grades as criteria of college suc- 


cess. 

It is, perhaps, of interest to note that 
between freshman year and the upper- 
class years the validity of the High School 
Record for major-field grades falls off 
more sharply than do the validities of 
most of the test scores. It seems possible 
that the falling off of validities for High 
School Record in the case of specific 
course areas may be brought about by 
differences between general courses taken 
particularly in freshman year and the 
specialized, major-field courses taken dur- 
ing junior and senior years. The study 
techniques or other methods used to gain 
good grades in high school cannot be very 
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different from those required in the more 
general college courses. However, quite 
different techniques may be required for 
specialized major-field work. 

The validity of SAT-V is highest for 
social science; that for SAT-M is highest 
for science and mathematics. For humani- 
ties and languages the SAT-V validity is 
dominant over SAT-M as it is for social 
science, but both SAT validities are lower 
than they were for social science. The 
substantially lower validity of SAT for 
humanities and language grades cannot 
be attributed to low reliability of the cri- 
terion, because as shown in an earlier sec- 
tion, the reliabilities for the criteria in the 
three areas are, respectively, .79, 86, and 
83. 

The differences in validity for the SAT 
mentioned in the last paragraph are all 
significant at about the 1% level. Differ- 
ences mentioned below in connection with 
the other measures are less significant. 
They are based on fewer cases. Even some 
relatively small differences will be men- 
tioned to draw the reader’s attention to 
differences of interest in judging whether 
further data are likely to reveal significant 
differences which will lead to useful dif- 
ferential prediction among the various 
specialized areas. 

Among the experimental tests some dif- 
ferent patterns of validity coefficients may 
be found for the three areas. Sufficiency 
of Data and Physical Science Information 
appear from these data to be superior to 
SAT-M as specific predictors of science 
and mathematics grades. While SAT-V is 
as good as any of the tests as a specific 
predictor of social science grades, Social 
Studies Reading and Data Interpretation 
are shown by these data to serve about as 
welj. Government Information, as might 
be expected from the content of the test, 
has a relatively high correlation with so- 
cial science considering that there has been 
no correction for its very short length, 
but it also has a surprising validity for 
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science and mathematics. The only spe- 
cific predictor for humanities and lan- 
guage grades is the poor general predictor, 
Perceptual Speed. A very good predictor 
of humanities and language grades, con- 
sidering its short length, is also one that 
might be expected to be suitable for this 
purpose, Literature Information. How- 
ever, this test shows no promise for dif- 
ferential prediction. 


SuMMARY 


The College Board in 1951 initiated a 
validity study of the SAT and a group of 
experimental tests at 10 colleges. This ar- 
ticle compares the validities of these tests 
for average freshman grades with their 
validities for the cumulative four-year 
average and graduation vs. nongradua- 
tion. The validities for freshman grades 
in certain subject-matter areas are com- 
pared with major-field grades in the same 
areas. 

It was found that the pattern of test 
validities for the four-year criteria closely 
resemble those for the freshman criteria. 
These data show the high school record 
to be less good for predicting the quality 
of major-field work than it is for predict- 
ing freshman average grades. Tests of 
government and literature information 
were the most successful among the ex- 
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perimental tests. When corrections were 
made for restriction of range and for test 
length, these two tests were actually found 
to be more valid for predicting the cumu- 
lative four-year grade average than was 
the SAT. Neither the SAT nor any of the 
experimental tests had an appreciable 
validity for predicting graduation. 
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A NOTE ON PART-WHOLE CORRELATION" 


FREDERICK B. DAVIS 
Hunter College 


When a correlation coefficient is com- 
puted between total scores (such as the 
Performance scores derived from the 
Wechsler Adult Intelligence Scale) and 
part scores (such as the Object Assembly 
scores) that are included in the total 
scores, the resulting coefficient is spuri- 
ously high. There has been some confusion 
in the literature regarding the source and 
amount of this spuriousness. It is the 
purpose of this note to clarify the matter. 

In deviation-score form of the original 
units of measurement, the product- 
moment correlation coefficient between 
scores on total t and on part a, which is 
wholly included in total t, may be written 
as follows: 


Tat = T(a)(n+b+. 
es Daa + b+... +n) 
V D(a)? V Slat b+... +n)? 


Hence, 


.>n) 








n 
s. + z 85 Taj 


Ta = ’ 
Ss 


(1) 


where the subscript j denotes any part of 
total t except part a. 

This coefficient is spuriously high be- 
cause the errors of measurement in scores 
on part a are also in scores on total t. 
To obtain a coefficient free from this 
correlation of errors in common, a parallel 
form of part a, to be denoted Part A, 
may be employed. Part A is not, of course, 


* After this note had been accepted for 
publication, the writer's attention was called 
to a paper by Angoff (1) which he had not 
previously seen. Angoff’s equation (3) and its 
derivatives constitute special cases of the 
writer's Equation [3] in the present paper. 
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included in total t. Then, 


n 
BaToa + D 8j7aj 
b 


2 eee, 
8 


(2) 


where rea is the reliability coefficient of 
part a. 

Since parts a and A are parallel forms, 
Se = sa and ray = ra; . Therefore, we may 
rewrite Equation [2] as: 

n 
8aTaa + 2 8)T aj 


(3) 


Ta = 
8 


It is obvious from Equation [1] that 


n 
2 8iTaj = SiTat — Sa. 


Consequently, Equation [3] m._ be writ- 
ten as: 


8a(Taa _ 1) 
&& 7 


Either Equation [3] or [4] may be used 
to compute the product-moment coefficient 
of correlation between a total and a part 
included wholly within the total if one 
wishes to report a coefficient free from 
the :nflating effect of the correlation of er- 
rors of measurement common to both. 
Equation [4] will be the more convenient 
if rat is known. 

The difference between the values of 
Equations [1] and [3] or of Equations 
[1] and [4] may be written as: 


(4) 


Tat = Ta + 


s,(1 =- Tea) 
Ss w 


Tat — Tat = 


This difference, it should be noted, is not 
equal to the correlation of the errors cf 
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measurement in scores on total t and on 
part a. That correlation coefficient, ex- 
pressed in terms of the difference rat — rat , 
is: 


Tat — Tat 
yar, 1 
Vi Tea 1 eT 








Teset 


where rea and rer are the reliability co- 
efficients of part a and total t, respectively. 

McNemar (2, p. 164) and other writers 
have referred to the correlation coefficient 
between a part score and the remainder 
of the total score as the part-whole cor- 
relation coefficient corrected for spurious- 
ness. But it is obvious from its very 
definition that such a coefficient is really 
not a part-whole correlation coefficient; 
it is instead a part-remainder correlation 
coefficient, it needs no correction for 
spuriousness, and it may be denoted and 
computed as follows: 


Sat — Se 


T(a)it—.) = ° 
Vs? + 8? = 28a8eT at 


(6) 








Use of Equations [4], [5], and [6] may 
be illustrated with data pertaining to the 
relationship between Part 11 (Object As- 
sembly) and the Performance total score 
(the sum of Parts 7, 8, 9, 10, and 11) of 
the Wechsler Adult Intelligence Seale (3). 
The basic data, reported in terms of 
Wechsler’s Sealed Score units for a group 
of 200 eighteen-nineteen year olds, are 
as follows (when Xr denotes a Scaled 
Score on the Performance total and Xo 
a Sealed Score on the Object Assembly 


part): 
Xo = 10.00 
8 = 2.79 
Xp = 49.43 
8p = 11.83 


The correlation coefficient of 82 be- 
tween scores on Part 11 and the Per- 
formance total (ror) was computed di- 
rectly from the data. Equation [4] yields 


Tor = .82 
T.o0 = .65 


Top = .93 


a value of .74 for the correlation between 
scores on Part 11 and the Performance 
total free from the spurious inflation 
owing to the perfect correlation of errors 
of measurement common to both scores. 
Equation [6] yields a value of .71 for the 
correlation between scores on Part 11 
and the sum of the remaining parts of the 
Performance total. Equation [5] yields a 
value of .52 for the correlation between 
errors of measurement in the entire Per- 
formance total and in Part 11 alone. 

As would be expected, the coefficients 
yielded by Equations [1], [3] or [4], and 
[6] range themselves in order of decreas- 
ing magnitude. The coefficient of 82 in- 
dicates the actual relationship of two 
partially overlapping variables—scores on 
Part 11 and the Performance total in the 
sample of 18-19 year olds. On the other 
hand, the coefficient of .74 indicates the 
relationship of two entirely separate vari- 
ables that measure the same abilities (plus 
chance) as Part 11 and the Performance 
total in the same sample of 18-19 year 
olds. This is a part-whole coefficient prop- 
erly corrected for spuriousness owing to 
the correlation of errors in common. The 
coefficient of .71 indicates the actual re- 
lationship of scores on Part 11 and the 
sums of scores on other parts included 
in the Performance total. This is a part- 
remainder coefficient. 

Of the three coefficients, the one having 
the. value of .74 is most meaningful for 
comparison with the great majority of 
intercorrelations reported among mental 
tests. This is because such intercorrelations 
are ordinarily based on separate tests and 
are not inflated by correlation of errors 
of measurement in common. The co- 
efficient of .82 is of fundamental utility in 
computing variances, standard errors, etc. 
The meaning of the coefficient of .71 is 
clear, but this type of coefficient is not 
commonly of practical utility. Each of 
these coefficients has its own particular 
merit and the distinctions among them 
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should be recognized so that one will not 2. McNemar, Q. Psychological statistics. 
be confused with ther. New York: Wiley, 1955. 
_ 3. Wecusier, D. Manual for the Wechsler 
REFERENCES Adult Intelligence Scale. New York: 
t “ Psychological Corp., 1955, Tables 6, 7, 
1. Ancorr, W. H. A note on the estimation and 10. 
of nonspurious correlations. Psycho- 
metrika, 1956, 21, 295-297. Received November 14, 1967. 
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THE INFLUENCE OF CONSISTENT AND INCONSISTENT 
GUIDANCE ON HUMAN LEARNING AND TRANSFER! 
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In 1928, Goodenough (7) reported as 
a finding of her study on anger in young 
children an apparent relationship between 
inconsistency of parental discipline and 
frequency of anger outbursts. Other stud- 
ies on the effects of consistency and in- 
consistency followed. These dealt with 
various ways in which consistency or in- 
consistency is expressed, as for instance in 
parental demands, commands, etc. (1, 2, 
4, 5, 8, 12). The results of these studies 
all suggested the conclusion that incon- 
sistency in the behavior of an authority 
figure toward a child has disturbing effects 
both on the child’s immediate behavior 
and on his subsequent personality de- 
velopment. Support for this conclusion 
came from animal studies in which random 
reinforcement was a variable (11, 15, 
16). The last study on the effects of these 
variables appeared in 1952 (8), and our 
textbooks speak of the detrimental effects 
of inconsistency as established fact (e.g., 
3, 6, 9, 10, 13). 

The study here reported arose, however, 
as a result of an impression that the 
work done on the problem does not justify 
the conviction shown with regard to the 
detrimental effects of inconsistency. For, 
while the data strongly support the con- 
tention, e.g., that parental inconsistency 
has damaging effects on a child’s behavior, 


This paper is a condensation of the au- 
thor’s doctoral dissertation, completed in 
1956 at the University of Florida under the 
direction of Rolland H. Waters, who was also 
kind enough to read and make valuable sug- 
gestions about the present paper. The writer 
is also grateful to Henry S. Curtis, Morton 
8. Slobin, and Stanley Spiegel for their help 
with this paper. 

* Now at the VA Regional Office, Cleve- 
land, Ohio. 


the nature of the studies gives reason to 
question the validity of the data. The 
direct data on inconsistency come from 
case history and observational studies, 
studies too loosely designed to control 
for the possibility that it is the person 
being inconsistent rather than the in- 
consistency itself which is causing the 
damage. Baldwin, et al. (1), for instance, 
found that inconsistency figured in the 
rejecting parent’s behavior in that dis- 
cipline, decisions, ete., were based on the 
parent’s convenience. This finding would 
suggest that inconsistency is one avenue 
through which rejection is expressed, but 
for which the inconsistency could be ir- 
relevant. The studies involving random 
reinforcement seem better controlled, but 
only one (15) includes a study of im- 
portant transfer effects, and in every 
case we have no way of knowing how far 
we can generalize ‘rom infrahuman to 
human Ss. In brief, it appears that we 
actually do not know that inconsistency 
itself has a detrimental effect on behavior. 

The intent of this study, then, was to 
isolate and study the variables of con- 
sistency and inconsistency in a controlled 
laboratory setting. The study was not 
designed to investigate the effects of pa- 
rental consistency or inconsistency. Al- 
though primary interest has been in the 
effects of parent-child inconsistency, it was 
considered important to test the specific 
effects of these variables apart from other 
conditions. An attempt was made simply 
to answer the following question: If Ss 
are given consistent or inconsistent guid- 
ance while learning to solve a maze prob- 
lem, in what ways will their learning be- 
havior be affected both in the immediate 
learning situation and later when they are 
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no longer being guided and are confronted 
with a similar but different problem to 
solve? 

PROCEDURE 

Eighty-eight college students, male and 
female, under 26 years old, and essen- 
tially inexperienced with maze problems, 
served as Ss. As each S appeared at the 
experimental room, he or she was assigned 
randomly to one of three groups, known 
as Groups I, II, and C. The Ss were 
seated before a shield which obscured the 
apparatus and the experimenter. They 
were given a general description of the 
type of maze they were to learn, and told 
that their purpose was to learn to guide 
a stylus from start to goal without error. 
They were told further that when they 
reached the goal at the end of each run 
both the red and the green light suspended 
before them would flash to signal the end 
of a trial. In addition to this general 
orientation, Ss in Groups I and II were 
told that when they made a correct turn 
the green light would flash, and when 
they made a wrong turn the red light 
would flash. 

The stylus was placed in the Ss hand 
and guided to the starting point of a 
standard 10-turn Warden U-type maze 
(14) employed, and the S was told to 
start. Light cues were given Group I Ss 
consistently according to instructions. Un- 
known to Group II Ss, however, the light 
cues given them were wrong at three of 
the ten choice points on each trial. Also, 
the choice points at which wrong cues were 
given were varied from trial to trial ac- 
cording to a prearranged pattern. Group 
C Ss were given no guidance and served 
as the control group. 

Whether or not they reached the cri- 
terion of one errorless run, all Ss were 
required to run trials, after which all 
were stopped and transferred to the lat- 
eral reverse of the practice maze pat- 
tern. Here they were told that their task 


and purpose were the same, but that the 
only light cues they would see would be 
those at the end of each run. All Ss were 
then allowed to run until they reached 
the criterion of one errorless trial. Records 
were kept of errors and time per trial, and 
of trials to criterion, and notes were taken 
of spontaneous behavior exhibited. Fol- 
lowing completion of the second maze 
problem, Ss were interviewed with regard 
to their impressions of the experimental 
experience. 


REsvULTs 


The quantitative results are summarized 
in Tables 1 and 2. It will be noted that no 
figures are given for trials to criterion 
on the practice maze. The reason for this 
is that since only 11 Group I Ss and eight 
Group C Ss reached criterion in 15 trials, 
it was not possible to compute mean trials 
to criterion. Instead, the percentages of 
Ss in each group who reached criterion 
were computed, and these percentages 
were compared using the chi-square 
method. 

For the inconsistently guided Group II, 
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mean total errors and time were signifi- 
cantly greater than those for Groups I 
and C, both on practice and transfer 
mazes. A significantly smaller proportion 
of Group II Ss reached the criterion 
within 15 trials on the practice maze 
(chi square 14.05, significant beyond the 
01 level). Group II Ss required signifi- 
cantly more trials to reach criterion on 
the transfer maze than did Groups I and 
C. Group C practice maze time was sig- 
nificantly greater than that of Group I, 
but otherwise Group I performed only 
slightly and insignificantly better than did 
Group C. 

While variances did not differ signifi- 
cantly for the practice maze, they did for 
the transfer maze. Group II variances 
were significantly greater than those of 
Group I for all measures, and greater 
than those of Group C for time and trials 
to criterion. Also, Group C variances were 
significantly greater than those of Group 
I for errors and trials to criterion. 

The behavioral data characterized 
Group I Ss as initially dependent upon 
the light cues but as gradually showing less 
dependence upon them. Group II Ss were 


more characteristically confused by the 
lights at first and then reacted to them 
in one of three ways: either they rebelled 
against instructions and ignored the lights, 
they were confused and ambivalent about 
them, or they followed them passively. 
Group II Ss tended also to be uneasy 
about verbalizing doubts concerning the 
accuracy and usefulness of the light cues; 
those bold enough to rebel against the use 
of the cues were quite outspoken, but at 
the other extreme those who followed 
the cues passively distorted their per- 
ceptions of the situation so far as to in- 
sist that the cues were helpful. Further, 
transfer maze performances for Group II 
were related to the degree to which Ss 
had ignored the light cues on the practice 
maze, ie., those who ignored the lights 
tended to do as well as the best in Groups 
I and C, ete. Group C Ss approached the 
mazes in a matter-of-fact, business-like 
manner. 


Discussion 


The excessive variance of Group II 
transfer maze performance, together with 
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the protocol material, makes an interpre- 
tation of the effect of the inconsistent 
guidance difficult. On the one hand, the 
group performances suggest that to a 
significant degree Group II was adversely 
affected by the inconsistent guidance. On 
the other hand, however, the magnitude 
of Group II variance cautions against 
drawing such a broad conclusion, because 
the inconsistency can hardly be said to 
have had a very uniform effect on Group 
II Ss. 

The results seem to become understand- 
able when Group II variance and be- 
havioral data are considered in detail. 
First, the possibility of a sampling bias 
contributing to the variability can be 
discarded on the grounds that the groups 
showed similar variability on the practice 
maze. Can it be concluded then that the 
inconsistency itself produced the variabil- 
ity, or did the inconsistency bring into 
play personal variables which determined 
individual performances? 

The behavioral data suggest the latter 
of the two possibilities. It appears that 
the inconsistency provoked three grossly 
different personal reactions, ie., a de- 
fiant and rebellious one, a confused and 
ambivalent one, and a passive one. It ap- 
pears further that the particular personal 
reaction provoked was related to trans- 
fer maze performance. It is true that the 
inconsistent cues had the initial effect of 
confusing all the Group II Ss (perhaps 
accounting for the more similar variance 
of practice maze performance), but this 
effect did not last for those Ss who were 
able to break away from the light cues 
and to attend to cues from the maze 
itself. Lasting confusion and damage to 
performance seemed to occur primarily 
when Ss could not break away from the 
inconsistent guidance. These Ss emerged 
from the practice maze experience with 
little useful information to apply in deal- 
ing with the transfer maze. It would seem 
necessary to conclude, then, that the in- 
consistent guidance had the immediate 


effect of confusing the recipient, but that 
its effects were temporary unless the re- 
cipient was unable to rebel against the 
inconsistent guidance. 

Some clues are present which suggest 
an explanation for this behavior of Group 
II Ss. It appears that the more ambivalent 
and passive Ss were those who seemed to 
fear offending the experimenter and/or 
being embarrassed by questioning the cues. 
For personal reasons these people seemed 
to feel uncertain enough in the relation- 
ships with themselves and/or with the 
experimenter to feel that it was important 
not to question the experimenter too 
seriously if at all—the safest reaction be- 
ing complete passivity. 

Finally, a few words might be said 
about the variance differences between 
Groups I and C, where actual perform- 
ances did not differ significantly. It is 
felt that the guidance given Group I en- 
couraged group conformity of perform- 
ance, while no guidance perhaps allowed 
Group C Ss to develop whatever poten- 
tials they had. 


IMPLICATIONS 


If the results of this study are de- 
pendable, they raise important questions 
about the origins of behavior pathology. 
Broadly speaking, the results make it diffi- 
cult to maintain the position that a par- 
ticular type of experience will affect per- 
sonality in a particular way. We are 
confronted again by that constant source 
of irritation, the intervening variable. In 
this particular instance, the effect that 
the “experience” had was apparently in- 
fluenced by how the S perceived the sit- 
uation, and that perception seemed in 
turn influenced by how secure the S felt 
in relation to himself and/or to the ex- 
perimenter. What apparently was im- 
portant here was whether the S perceived 
the situation as one in which he could 
comfortably question the misguiding in- 
formation he was receiving. 
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It would seem important, then, in un- 
derstanding the origins of behavior dis- 
turbance, to study some of the intervening 
variables which could play a role in de- 
termining the effect that a particular ex- 
perience might have on the developing 
personality. With regard to the results 
of the present study, it would seem im- 
portant to know more about the variables 
which influence a person to perceive a 
situation as one in which he could or could 
not comfortably question inconsistent 
guidance being given him by an authority 
figure. In the final analysis, a study of 
such intervening variables may reveal that 
what a parent actually does or does not 
do with regard to his child is not nearly 
so important for the developing person- 
ality as is, for instance, the interpersonal 
relationship in which this act occurs. 


SuMMARY AND CONCLUSIONS 


This study was designed to answer the 
question: If Ss are given consistent or in- 


consistent guidance while learning an in- 
itial maze problem, in what ways will 
their learning behavior be affected both 
in the immediate and in a transfer situa- 
tion? On a 10-turn Warden U-type maze, 
Ss were given either consistent, incon- 
sistent, or no guidance. After 15 trials 
under one of these conditions, all Ss were 
transferred to the lateral reverse of the 
initial maze where all were required to 
run without guidance until one errorless 
run was achieved. After learning the trans- 
fer maze, Ss were interviewed for im- 
pressions of the experiment, The results 
suggested the following statements in an- 
swer to the motivating question: 

1. The influence of consistent guidance 
is not markedly different from that of no 
guidance. 

2. While inconsistent guidance is being 
given it has a confusing and generally 
detrimental influence on learning as com- 


pared with the influence of consistent or 
no guidance. 

3. Inconsistent guidance does not nec- 
essarily have lasting damaging influence 
on learning behavior. 

4. Lasting damage to learning behavior 
results from inconsistent guidance when 
the recipient of the guidance is for some 
reason unable to rebel and ignore the 
guidance. 
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One of the most difficult problems that 
must be solved before useful results can 
come from research into the relationship 
between teacher personality and pupil 
growth is that of securing objective meas- 
ures of the teacher’s personality as it 
functions in the classroom. The usual ap- 
proach to this problem has been to use 
ratings by supervisors or specially trained 
observers, but, despite all attempts to 
improve them, such ratings are still biased, 
subjective, and in many cases uninterpret- 
able by anyone, even the rater himself. 

Whatever value such ratings have arises 
from the fact that they are based on ob- 
servations of the teacher while he is 
teaching; their most serious limitations 
arise from the fact that the evaluative 
judgment of the rater intervenes between 
the behavior and the score supposed to 
reflect it. There are at least two sources 
of variation introduced here that attenuate 
the validity of the ratings by distorting 
measured differences between teachers. 
The cues upon which the observer bases 
his judgment and the relative weights as- 
signed to them are both allowed to vary 
from observer to observer to some un- 
known degree. By providing a schedule 
for recording behaviors listing the cues to 
be responded to, the first source of error 
may be virtually eliminated. By making 
the assignment of weights a clerical task 
done by someone other than the observer, 
the second may also be made negligible. 

As a part of a longitudinal study of 
graduates of the Teacher Education pro- 
gram of the municipal colleges of New 
York City (City, Hunter, Brooklyn, and 
Queens) carried out in the Office of Re- 
search and Evaluation of the Division of 
Teacher Education, a technique for ob- 
jectively observing and recording class- 


room behaviors was developed. The Ob- 
servation Schedule and Record (OScAR) 
was constructed by modifying and com- 
bining the methods proposed by Cornell 
(1) and Withall (4) on the basis of the 
results of tryouts of the two techniques. 
Three basic changes were made. 

Inspection of the reliabilities of the 
seales prepared by Withall and Cornell 
showed that some of them suffered from a 
lack of observer agreement to a degree 
that seriously impaired their accuracy (2). 
Accordingly, the first change was designed 
to increase observer accuracy. If an ob- 
servational technique is such that it takes 
a highly trained observer to use it suc- 
cessfully, it has limited usefulness, and 
results of future measurements may be 
suspect because the observers may be in- 
adequately trained. For this reason, the 
scales of both Cornell and Withall were 
redefined in somewhat simpler terms for 
use in the OScAR in order to minimize the 
amount of training necessary for its use. 

Experience with these two techniques 
also showed that the often-adopted prac- 
tice of sending several observers into the 
classroom together (presumably so that 
one observer can record what another 
misses) is uneconomical. A score based on 
observations made by two observers who 
see a teacher at different times is actually 
more reliable than one based on observa- 
tions made by two observers who see the 
teacher at the same time; and it seems 
intuitively obvious that the former score 
is more valid as well, since the behavior 
sample obtained is twice as great. The 
OScAR was therefore designed to be used 
by a single observer visiting a classroom 
by himself. 

The third change involved was the sepa- 
ration of the process of scoring from the 
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process of observing teacher behaviors. 
The OScAR was designed to permit the 
recording of as many aspects of what goes 
on in a classroom as possible, regardless 
of their relationship to any dimension or 
scale. The observer’s sole concern was to 
see and hear as much of what was going 
on as he could, and to record as much of it 
as the structure of the OScAR permits, 
without any attempt to evaluate what he 
saw. 


DESCRIPTION OF THE OScAR TECHNIQUE 


The OScAR technique is both a method 
of observing and a method of recording 
classroom behavior; in the interests of 
simplicity the two aspects will be de- 
scribed simultaneously. 

The observer making a visit to a class- 
room arrives at—or near—a prescheduled 
time, so it is usually not necessary for 
him to greet the teacher or class when he 
arrives. Instead, he tries to eater and take 
a seat at the back of the room as unob- 
trusively as possible. He first notes the 
time and the number of pupils present in 
the spaces at the upper left corner of the 
“front” of a specially printed 5 x 8 card 
(see Fig. 1). Then he starts his stopwatch 
and begins to record behaviors on the 
front of the card by checking as many of 
the items in the Activity Section as de- 
scribe what he sees. 

The Activity Section consists of 44 ac- 
tivities likely to be observed in a class- 
room, such as “teacher works with individ- 
ual pupil,” “pupil writes or manipulates 
at his seat,” “pupil laughs.” Varying num- 
bers of the Activity items may be checked, 


* Tables A through G and Figures 1 and 2 
have been deposited with the American 
Documentation Institute. Order Document 
No. 5556, remitting $1.75 for 35 mm. micro- 
film or $2.50 for 6 by 8 in. photocopies. 
Typescript copies of a more detailed version 
of this paper containing all tables will be 
furnished on request to the authors while 
the supply lasts. 


according to how many different kinds of 
activities are going on at one time. 

The observer then concentrates on the 
Grouping Section. The Grouping Section 
lists four sizes of groups from “at least 
half of class in group with teacher” and 
“at least half of class in group without 
teacher” to “pupil as individual.” In 
Column I he checks each type of admin- 
istrative group (i.e. group apparently set 
up by the teacher) that he can detect in 
the class and each type of social group he 
observes—a social group being defined as 
one in which there is pupil-pupil or pupil- 
teacher interaction. 

Next the observer checks the type of in- 
structional materials being used, in the 
Materials Section, which lists various 
learning aids and materials such as black- 
board, audio aid, text or workbook. All 
through this initial period, the observer 
keeps alert for any type of activity, group- 
ing, or material not already checked, and 
checks the appropriate item for each one 
as it occurs. No item on this side of the 
card is checked more than once during this 
time, however. Items in the Signs Section 
(which consists of items considered symp- 
tomatic of classroom climate, like “teacher 
shows affection for pupil” and “pupil 
moves freely”) are marked with a plus 
sign if and when they are observed. At 
the end of five minutes the observer 
briefly considers each item in this section 
not already marked, and marks it either 
plus or zero. 

As soon as he has done this, the observer 
stops his watch and turns the card over 
(See Fig. 2). In the Subject Section, which 
lists the 10 most common subject areas, 
he checks in Column I whichever of the 
10 areas of instructional activities has re- 
ceived most attention during the five 
minutes just ended. 

The observer then starts his stopwatch 
again and begins to tally each statement 
the teacher makes in one of five cate- 
gories: Pupil-Supportive, Problem-Struc- 
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turing, Miscellaneous, Directive, Reprov- 
ing. He makes a tally in Column II of the 
Expressive Behavior Section in the line 
corresponding to the category in which 
each statement is classified. 

At the same time, he watches for 
changes of expression on the teacher's 
face, such as smiles, frowns, and scowls, 
and for expressive gestures such as nods, 
threatening glances, and body nnovements. 
Each time he observes a loc or gesture 
which he judges to express approval of or 
affection for a pupil, the observer makes 
a tally in Column II after Item K1; each 
time he observes a look or gesture which 
he judges to be hostile or reproving, he 
makes a tally after K7. 

This continues for a second period of 
five minutes. At the end the observer stops 
his watch again and fills out Column II 
in the Subject Section just as he filled out 
Column I at the end of the first five- 
minute period. He then turns the card 
over, starts his stopwatch again, and pro- 
ceeds as in the first period for five minutes 


more, except that he uses Column III 
rather than Column I. This alternation of 
sides of the card is continued until six 
five-minute periods of observations are 
completed. 


CoLLECTION OF Data 


The observations which form the pri- 
mary data of this study were made with 
OScAR in the classrooms of 49 beginning 
teachers in public elementary schools in 
New York City over a period of approxi- 
mately 10 weeks. Of the 49 teachers, 46 
were female, 3 male. The teachers were 
scattered among 19 schools in four bor- 
oughs, the number of teachers in a single 
school ranging from two to five. Twenty- 
three of the teachers taught Grade 3, 
thirteen Grade 4, nine Grade 5, and four 
Grade 6. 

Observers worked in pairs, two ob- 
servers visiting a school together. In most 
cases, all of the teachers in a school were 


seen by both observers in a pair on the 
some day, although in no case did two 
observers visit the same teacher at the 
same time. No attempt was made to con- 
trol the type of activity observed; all that 
was asked was that the teacher and the 
class be present in the classroom. 

A number of minor shortcomings in the 
original OScAR (2) having been noticed, 
it was revised to the form described in 
this report. The new form was adopted at 
the beginning of the second round of visits. 
The pairs of observers who went to the 
schools together were reshuffled somewhat, 
and a new schedule in which each observer 
was to see each teacher once again was set 
up. The first visits were made on January 
24, 1955, and the last on Tuesday, April 
5, 1955. 


ANALYSIS OF THE OBSERVATIONAL 
ReEcorps 


The analysis of the data followed four 
steps. First, a preliminary study was made 
of each item to find out whether there 
were reliable differences in the number of 
times the behavior was observed in the 
classrooms of different teachers. Next, the 
items were combined into 14 “keys,” which 
were scored. Third, a factor analysis of 
scores on these 14 keys was made; and 
finally, the keys were combined into three 
factor dimensions. 

The results of the analysis of individual 
items are given in Tables A through F.’ 
Except in the case of a few items that 
were highly reliable by themselves, those 
items that discriminated well were com- 
bined into provisional keys on the basis 
of a priori judgment that they belonged 
together. 

For example, the following three items 
from the Activity Section: 

El. pupil talks to a group 
E5. pupil demonstrates or illustrates 

E10. pupil leads the class 
were combined into a single key called 
“Pupil Leadership Activities.” (The com- 
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TABLE 2 
LoapINGs oF FourTEEN OScAR Scorine Keys on THREE ORTHOGONAL Factors 
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(1) Time spent on reading 
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statements 
(3) Autonomous 
groupings 
(4) Pupil leadership activities 
(5) Freedom of movement 
(6) Manifest teacber hostility 
(7) Supportive teacher behavior 
(8) Time spent on social studies | 
(9) Disorderly pupil behavior 
(10) Verbal activities 
(11) Traditional pupil activities 
(12) Teacher’s verbal output 
(13) Audio-visual materials 
(14) Autonomous social groupings 
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TABLE 3 


INTERCORRELATIONS AMONG THREE Factor 
ScaLes Basep on OScARs or 49 
BEGINNING TEACHERS 


(Reliabilities in The Diagonal) 
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Emotional Climate (.903) — .004 |—.110 
Verbal Emphasis (.770) + .028 
Social Structure (. 826 ) 


position of each of the 14 keys found to 
discriminate is given in Table G.) The 
reliability of each key was estimated from 
a three-way analysis of variance under 
mixed-model assumptions—teachers and 
visits being regarded as random effects 
and items as a fixed effect. 

The coefficient of reliability so obtained 
is a maximum likelihood estimate of the 
expected correlation between the mean of 
all the scores assigned to the teachers by 
the six observers on the basis of the twelve 
visits made, and means of scores that 
would be assigned to the same teachers by 
six different observers visiting their class- 
rooms at twelve other times. Errors arising 
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from three potentially important sources 
are taken into account: errors resulting 
from fluctuations in teacher and pupil be- 
haviors during several weeks, errors re- 
sulting from differences in ways in which 
various observers would tally identical 
behaviors, and errors resulting from the 
failure of an observer to note and record 
all that happens during a five-minute 
period. 

Table 1 shows the reliabilities of all 14 
scoring keys and their intercorrelations. 
The sizes of the reliability coefficients in- 
dicate that these teachers’ classes differed 
widely with respect to what was going on 
in them. 

The intercorrelations among the 14 di- 
mensions suggest that the differences 
might be described in terms of fewer than 
14 variables, so a centroid factor analysis 
was made and three factors extracted. 
The centroid factor matrix was rotated 
orthogonally twice according to the pro- 
cedure proposed by Reyburn and Taylor. 
Table 2 shows the loadings of the original 
keys on the three factors after rotation. 
The factors were named Emotional Cli- 
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mate, Verbal Emphasis and Social Struc- 
ture. 

Three scales were constructed by com- 
bining, with equal weights and the signs 
indicated by the loadings, the scores on 
those keys most highly loaded on each 
factor. Table 3 shows the reliabilities of 
these three scales and their intercorrela- 
tions. The three scales are practically in- 
dependent of one another (as would be 
expected) and are highly reliable 


Discussion 


The effort made in this study to secure 
quantitative, objective information about 
happenings in ordinary classrooms and 
typical learning situations was not in- 
tended to imply that ratings by super- 
visors and other qualified observers may 
not serve a useful purpose. It arose from 
the conviction that there are purposes 
such ratings cannot serve. One such pur- 
pose is research into the nature of teacher 
effectiveness—research seeking to answer 
questions about how teachers influence 
pupil learning. 

Information that effective teachers are 
warm and friendly, or firm but fair, or 
that they explain things clearly, is useful 
in this sense only if these terms are oper- 
ationally defined. If such operational defi- 
nitions must be phrased in terms of expert 
judgment, they can tell us only about ex- 
pert judgment. Whatever inferences re- 
search with a technique such as OScAR 
justifies will tell educators what a teacher 
should do in specific terms—not what 
someone's reaction to his behavior ought 
be. 

. A study of the factorial structure of the 
14 scoring keys indicates that the OScAR 
jechnique gives reliable information about 
three relatively discrete dimensions of 
classroom behavior—the social-emotional 
climate, the relative emphasis on verbal 
learnings, and the degree to which the 
social structure centers about the teacher. 
Certainly there must be many other im- 
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portant differences in ways teachers and 
pupils behave that are not included in this 
list. It is important that such differences 
be identified and techniques developed for 
observing them. 

The potential importance of the kind of 
objective data about classroom behavior 
that can be obtained in this way is very 
great. Practical problems such as how to 
select students likely to become successful 
teachers, how to screen out those who can- 
not get along with children, and what 
ought to be the content of teacher training, 
can be solved in no other way than by 
studying teachers’ classroom behavior. 


SuMMARY AND CONCLUSIONS 


The OScAR was developed as a device 
for securing a record of behaviors of teach- 
ers and pupils observed by a classroom 
visitor. It was used in a series of 588 half- 
hour visits made by six observers visiting 
49 teachers twice each. Items which on the 
basis of content appeared to belong to- 
gether were grouped into 14 keys which 
were found to have reliabilities of at least 
60. A factor analysis identified three or- 
thogonal factors accounting for most of 
the observed differences. 

The three aspects in which the behav- 
iors observed in the 49 classrooms differed 
were: Emotional Climate, having to do 
with the relative amount of hostility ob- 
served; Verbal Emphasis, having to do 
with relative emphasis on verbal and tra- 
ditional schoolroom activities; and Social 
Structure, having to do with the relative 
degree of pupil-initiated activity. These 
three aspects were found to be orthogonal 
—a hostile class was no more likely to be 
verbal, or to have a restricted social or- 
ganization than one less hostile. 

It was concluded that (a) relatively un- 
trained observers using an instrument like 
OScAR can develop reliable information 
about differences in classrooms of different 
teachers, (b) that the OScAR technique is 
sensitive to only three of many dimensions 
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that probably exist, and (c) that obser- 
vations made with instruments of this type 
can contribute to the solution of many 
important problems having to do with the 
nature of effective teaching. 
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A FAILURE IN THE PREDICTION OF PUPIL-TEACHER RAPPORT" 


WILLIAM RABINOWITZ AND IRA ROSENBAUM 
Division of Teacher Education, Municipal Colleges of New York City 


The essential purpose of this study was 
to determine the success with which sev- 
eral test instruments could predict the 
pupil-teacher rapport achieved by a group 
of teachers. The participating subjects 
took the tests as student-teachers; the cri- 
terion measure of rapport was obtained 
approximately one year later in the class- 
rooms of the same subjects, who were then 
completing their first year of teaching. By 
employing test and criterion measures that 
were clearly separated in time, the study 
attempted to determine the predictive 
validities of the tests for the criterion used. 

Pupil-teacher rapport was measured 
through pupil responses to questions about 
their class and their teacher. The variable 
to be predicted was, therefore, not teacher 
behavior, but pupil reactions to teacher 
behavior. Since it cannot be assumed that 
pupils respond in similar fashion to similar 
teacher behaviors, tests that validly pre- 
dict various aspects of the classroom be- 
havior of teachers might not predict pupil 
responses to such behavior. For this rea- 
son, a number of measures based on the 
teachers’ classroom behavior were included 
in the study as a “bridge” between the 
test measures and criterion measure of 
major interest. 


METHOD 


During the 1953-54 academic year, over 
1600 students who were enrolled in student 
teaching in the four municipal colleges of 
New York City were given a battery of 
tests. Some of the instruments of the 


*This is one of a series of studies of 
teacher behavior currently being conducted 
by the Office of Research and Evaluation of 
the Division of Teacher Education of the 
Municipal Colleges of New York City. A 
longer version of the present paper may be 
had on request as long as the supply remains. 


battery were standardized inventories; 
others were experimental in nature. The 
students took the tests at the beginning 
of the student-teaching semester, which 
occurred at the end of their senior year. 

During the academic year 1954-55, a 
follow-up of the student teachers who 
were tested the year before and had 
subsequently received bachelors degrees 
was undertaken. Those students who were 
then teaching in Grades 3 to 6 in New 
York City public elementary schools in 
which at least one other member of the 
group was also teaching were encouraged 
to participate as subjects in an observa- 
tional study. Of approximately 75 teachers 
who met these criteria, it was possible to 
conduct intensive observations in the class- 
rooms of 49. In addition, several tests 
were administered to the pupils taught by 
these 49 teachers and to the teachers 
themselves. 

This report will discuss three kinds of 
data: 

1. Test scores of 49 student-teachers 
obtained during their senior year in col- 
lege. 

2. Classroom behavior records obtained 
through systematic observation approxi- 
mately one year later in the classrooms of 
these 49 former student-teachers. 

3. Scores on pupil-teacher rapport as- 
signed to the 49 teachers on the basis of 
the reactions of their pupils to a paper- 
and-pencil attitudinal measure. 


Test Scores 


From the large group of tests taken by 
the student teachers, the authors selected 
the following tests which, on the basis of 
prior research and educational theory, 
could be expected to function as predictors 
of pupil-teacher rapport. 
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1. The Minnesota Teacher Attitude In- 
ventory (MTAI). The MTAI was scored 
with two keys: the first, the published, 
empirically-derived key (3), and the sec- 
ond, an experimental key in which the 
items were scored on an a priori, rational 
basis. 

2. The California F Scale. A 30-item 
version of the F scale was developed using 
the item analysis data in The Authoritarian 
Personality (1). 

3. The Draw-a-Teacher Technique 
(DaTt). In the DaTt, a subject is given in- 
structions to “draw a teacher with a class” 
(10). The drawings were scored by three 
scorers along three dimensions—Teacher 
Initiative, Psychological Distance, and 
Traditionalism in Classroom Organization 
(9). Interscorer agreement was estimated 
by analysis of variance procedures; the 
following intraclass correlations were ob- 
tained: 


Teacher Initiative. . eta aa oe 
Psychological Distance.............. 93 
Traditionalism in Classroom Organi- 


4. Sims SCI Occupational Rating Scale 
(SCI). The SCI scale “is an instrument de- 
signed to reveal the level in our social 
structure—i.e. the social class—with which 
a person unconsciously identifies himself” 
(11, p. 1). A subject taking the SCI scale 
indicates whether he generally considers the 
people in each of 42 occupations (repre- 
sentative of varying levels of socioeconomic 
status) as belonging in the same, a higher, or 
a lower social class than he himself does. 

5. Strong Vocational Interest Blank (In- 
dex R). index R is a 95-item key developed 
by Mitzel (8) for the Strong Vocational 
Interest Blank for Women. This key is com- 
posed of those items which successfully 
discriminated high-rapport and low-rapport 
teachers (differentiated on the basis of 
principals’ judgments and MTAI scores) 
and which survived cross-validation (based 
on extreme groups differentiated by the 
MTAI). 

6. Inventory 1V—Satisfaction Score. In- 
ventory IV is an experimental inventory 
consisting of 32 multiple-choice items deal- 
ing with student-teaching experiences. It 
is scored to obtain a measure which, on the 
basis of the manifest content of the re- 
sponses, appears to indicate the student- 
teacher’s satisfaction with the student-teach- 
ing experience (2). 


Measures of Classroom Behavior 


A technique for observing and record- 
ing what occurs in a classroom, called 
the Observation Schedule and Record 
(OScAR), was developed to provide a 
means for objectively describing a variety 
of different classroom activities (6). The 
technique provides measures of the fre- 
quency of occurrence of specific classroom 
events, and requires few inferences on the 
part of the observer. 

Each of the 49 teachers was observed by 
six different research workers. They ob- 
served each teacher for two one-half 
hour periods, adding up to a total of 588 
observation periods. No two observers 
visited any given classroom at the same 
time. 

From the basic behavioral data sup- 
plied by the OScAR technique, indices 
of teacher and pupil activities, types of 
pupil groupings, classroom climate, and 
expressive behavior of the teacher were 
derived. Of 14 dimensions developed, the 
following four were selected for study in 
this report because they seemed to be con- 
ceptually related to both the test measures 
and the criterion. 

1. Disorderly Pupil Behavior. This di- 
mension focuses on pupil behavior which re- 
flects either hostility or disruptive activity 
(eg., pupil ignores teacher's question, scuf- 
fles, etc.). It is a general index of the dis- 
order present in a given classroom. Its re- 
liability, determined by agreement among 
observers of the same class on different oc- 
casions, was estimated to be 89. 

2. Manifest Teacher Hostility. This di- 
mension provides an index of the overt, 
hostile, nonintegrative activity of the 
teacher. Verbal and nonverbal behaviors 
judged ‘to reflect teacher hostility (eg., 
sarcasm and scowling) were tallied and 
combined for this dimension. Its reliability 
was estimated to be 92. 

3. Pupil Leadership Activities. This di- 
mension provides an index of the amount 
of pupil leadership the teacher allows in 
classroom activity. It is based on activities 
in which a pupil addresses, or demonstrates 
to, the class. The reliability of this measure 
was estimated to be .72. 
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4. Freedom of Movement. This dimen- 
sion offers an index of the freedom of move- 
ment exhibited by both pupil and teacher 
in the classroom. It reflects the teacher's ap- 
parent willingness to circulate among the 
pupils, and the ease with which a pupil can 
move about without requiring special per- 
mission. Its reliability was estimated to be 
63 


Three other measures of the classroom 
were derived from global ratings of the 
classroom setting. The observers consulted 
the drawing scales developed and em- 
ployed as part of the Draw-a-Teacher 
technique described earlier. After cum- 
pleting an observation period, the observer 
rated the class on each of the following 
dimensions: Teacher Initiative, Psycho- 
logical Distance, and Traditionalism in 
Classroom Organization. The reliabilities 
of these ratings were estimated to be .72, 
71, and 85, respectively. 


Measure of Pupil-Teacher Rapport 


In the present study, pupil-teacher 
rapport was defined as the generalized, 
conscious, subjective regard expressed by 
pupils for their teacher. In order to 
secure measures of the way in which the 
pupils perceived their teacher, an inven- 
tory, My Class, was constructed (5). 
This inventory consists of 47 scored items 
comprising four scales: Halo, Disorder, 
Supportive Behavior, and Traditionalism. 
The Halo scale is designed to indicate the 
extent to which the pupils have a general 
feeling of liking for the teacher, while 
the other three scales are intended to 
measure fairly specific teacher and pupil 
behaviors. 

My Class was administered to all the 
pupils in the classes of the 49 teachers 
participating in the study. The items 
were read aloud to the pupils by a test 
administrator, while the teacher sat at the 
back of the room and filled out an inven- 
tory unrelated to the pupils’ activity. 
The proportion of the class giving the 
keyed response to each item was used as 
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the teacher’s score on that item. A 
teacher's score on each scale was the sum 
of proportions for all of the items of that 
scale, appropriately weighted plus or 
minus. 

The Halo scale consists of the following 
eight items scattered throughout the My 
Class inventory: 

1. Do you ever feel like staying away 
from school? 

2. Do you like to be in this class? 

3. Do you have much fun in this class? 

4. Do you learn a lot in this class? 

5. Are you proud to be in this class? 

6. Do you always do your best in this 
class? 

7. Do most of the pupils like the 
teacher? 

8. Does the teacher help you enough? 
The reliability of this scale, estimated by 
analysis of variance procedures, was .89. 


REsuULTS 


For each teacher in this study, there 
were 17 measures:* nine test scores, seven 
classroom observation measures, and one 
measure of pupil-teacher rapport based on 
pupil reactions. The primary analysis of 
these data consisted of correlating each 
of these measures with the other 16. 
The resulting correlations are contained 
in Table 1. From an examination of Table 
1 it is clear that: 

1. None of the tests correlates sig- 
nificantly with the measure of pupil- 
teacher rapport. 

2. None of the 63 correlations between 
the test variables and the classroom be- 
havior variables is significant except that 
between the Teacher Initiative score on 
the DaTt and the OScAR dimension, 
Freedom of Movement. 

3. The only classroom behavior variable 
that correlates significantly with the Halo 


*This is not strictly speaking the case, 
since complete test data were not available 
for every one of the 49 subjects. See foot- 
note a on Table 1. 
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score of My Class is the OScAR dimen- 
sion Manifest Teacher Hostility. The cor- 
relation is in the “expected” direction, 
i.e., the pupils’ liking for their teacher as 
indicated on My Class decreases with the 
amount of manifest teacher hostility re- 
corded by observers. Evidently the more 
hostility a teacher displays in the class- 
room, the less esteemed she is by her 
pupils. 

A multiple regression analysis was em- 
ployed using the pupil-teacher rapport 
criterion with all of the test variables ex- 
cept the MTAI-Rational Key score as 
independent variables. When weighted 
optimally with the partial regression coef- 
ficients, the eight test scores correlated 
496 with Halo. This multiple correlation 
coefficient is not significant. 


Discussion 


The major finding of this study is the 
failure of the tests, singly or in combina- 
tion with one another, to predict subse- 
quent pupil-teacher rapport as measured 
by the Halo scale. Each of the tests was 
selected for study because theory or past 
research, and sometimes both, encouraged 
its use as a potential predictor. The fact 
that none of the tests adequately func- 
tioned to predict pupil-teacher rapport is 
therefore of particular interest. 

One of the distinguishing features of 
this study is that the tests were ad- 
ministered to a group of college seniors 
who had not yet served as teachers. Since 
the tests and criterion were well separated 
in time, the study deals with the pre- 
dictive, rather than concurrent, validities 
of the tests employed. In general, the re- 
sults offer no evidence of the predictive 
validity of any of the tests for the par- 
ticular criterion measure studied. The tests 
not only failed to predict rapport, they 
did not correlate with the objective meas- 
ures of behavior in the classroom. Of the 
63 correlations between test variables and 
classroom behavior variables, only the re- 


lationship between the Teacher Initiative 
dimension of the DaTt and the Freedom 
of Movement dimension of OScAR proved 
significant. 

The fact that a test has 
validity is often incorrectly used to sup- 
port a recommendation for its use as a 
predictive measure. Thus, the MTAI has 
been shown to correlate with various in- 
dependent measures of pupil-teacher rap- 
port (3), including measures based on 
pupil responses to questions such as were 
used in My Class. The well-established 
concurrent validity of the MTAI does 
not, however, demonstrate that the test 
is of predictive value, and the recommen- 
dation contained in the test manual that 
the inventory be used as a predictor is ac- 
cordingly without empirical support. The 
evidence of the present investigation, 
which is the only published research of 
which the writers are aware involving the 
correlation of the MTAI and a subse- 
quent measure of pupil-teacher rapport, 
would argue strongly against its use as a 
predictive instrument. 

It may be important to note that the 
pupil-teacher rapport criterion used in 
this study was not so uniquely or com- 
pletely determined by the personalities 
of the pupils who responded te My Class 
as to be unrelated to measurable, be- 
havioral variables in the teacher. As Table 
1 indicates, one of the measures of the 
teachers’ classroom behavior, Manifest 
Teacher Hostility, correlates significantly 
with the criterion. Moreover, in a pre- 
viously reported investigation, Medley 
and Williams (7) found that the Halo 
scores of the 49 teachers in this study 
correlated significantly (r = +.34) with 
their scores on a concurrent test measure 
of hostility.” Since the criterion used in 


concurrent 


* It is of interest to note that the Hostility 
scale was built by selecting 50 items from 
among those on the Minnesota Multiphasic 
Personality Inventory that were found to 
discriminate significantly between teachers 
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this study is correlated with a concurrent 
measure of the teachers’ classroom be- 
havior and test behavior, the failure of 
the predictive instruments cannot be at- 
tributed to inherent unpredictability of 
the criterion. 

In the past, demonstrations of the con- 
current validity of tests have, too often, 
been uncritically accepted as evidence of 
their predictive value. The study reported 
here, however, adds support to the grow- 
ing view that the predictive value of tests 
can only be established through predic- 
tive studies. 


SuMMARY 


A large group of student teachers were 
given a number of personality and at- 
titude tests during their senior year in 
college. Observations were conducted ap- 
proximately one year later in the rooms 
of 49 of these subjects who were employed 
as elementary school teachers. A measure 
of pupil-teacher rapport based on pupil 
responses to questions about their teacher 
and their class was also obtained. 

In general, none of the test measures 
correlated significantly with pupil-teacher 
rapport as measured. Only one of the 63 
correlations between the test measures 
and classroom behavior measures proved 
significant. Manifest Teacher Hostility, a 
measure based on classroom observation of 
the teacher correlated significantly with 
rapport. 

The implications of these results for 
the prediction of pupil-teacher rapport 
were discussed. 


scoring high and low on the MTAI (4). In 
view of the manner in which the Hostility 
scale was developed, it is difficult to deter- 
mine why it should correlate significantly 
with the Halo scale while the MTAI does 
not. Only when the temporal relations of 
the Hostility scale and the MTAI to the 
criterion are fully appreciated does the dif- 
ference in the validity coefficients become 
understandable. 
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When the so-called readability formulas 
are used only as rough estimating devices 
for the encouragement of popular writing, 
statistical precision is not vitally impor- 
tant. But if they are to be considered re- 
search tools in studies of comprehension or 
learning, it becomes very important to 
build into them as much precision as pos- 
sible. 

Current readability formulas offer at 
least two opportunities for reexamination 
for the sake of greater precision. First, 
many are based on reading comprehension 
tests published in 1926 and drawn from 
empirical testing of school pupils prior 
to that date. Thus they may not ade- 
quately reflect changes in either the lan- 
guage or the population of the present 
decade. Second, the “ratings” produced by 
present tests are not accompanied by a 
standard error figure, and hence tell 
nothing about significance of estimates and 
differences. 

The revision of the set of graded test 
passages used in building two widely used 
readability formulas—the Flesch Reading 
Ease Formula (3) and the Dale-Chall 
readability formula (1)—has offered an 
opportunity for revision of the formulas 
and also for further comparative evalua- 
tion of these two indexes of comprehen- 
sion difficulty. The two formulas were 
originally calculated, following Lorge (6), 
by making measurements of sentence 
length and vocabulary difficulty in the 
1926 edition of the McCall-Crabbs Graded 
Test Lessons in Reading (7). Both for- 
mulas make use of the same sentence 
length measure—average number of words 
per sentence. For a vocabulary measure- 
ment, Flesch uses the number of syllables 
per 100 words, while Dale and Chall count 
the number of words that do not appear 


on a list of 3,000 words which had proved 
“familiar” for youngsters tested in the 
fourth grade of public schools. 

Results with the two formulas are not 
directly comparable for several reasons. 
Scores are given in different terms—Flesch 
results on a scale of 100 (easy) to 0 
(difficult) and Dale-Chall results on a 
scale of about 3 (easy) to 14 (difficult). 
Flesch’s formula was calculated with 
grouped data, while Dale and Chall com- 
puted theirs with ungrouped data. In 
addition, Flesch made an adjustment in 
one of the formula terms after computa- 
tion, while Dale and Chall did not.’ 

More recent arrivals on the readability 
scene are the Farr-Jenkins-Paterson sim- 
plification of the Flesch Reading Ease 
formula (2) and the Gunning Fog Index 
(4). The former uses a count of percentage 
of monosyllables instead of the Flesch 
syllable count, while the latter uses a count 
of polysyllables (words of more than two 
syllables). Both formulas take sentence 
length into account. Both are viewed by 
many as simplifications of the Flesch for- 
mula. 

The MeCall-Crabbs tests were revised 
considerably in 1950 (8). There is evi- 
dence that the questions and passages of 
the 1926 edition were changed considerably 
in the 1950 edition. At least 60 of the 
tests in the 1950 edition are different in 
subject from those in the earlier edition. 


* This adjustment concerned the criterion 
value used in developing the regression 
equation. Both Flesch and Dale-Chall used 
as a criterion the average school grade of 
pupils answering correctly 50% of the ques- 
tions accompanying the reading passages. 
But Flesch adjusted the formula to predict 
the grade of the pupil who could answer 
75% of the questions correctly. This changed 
the regression formula constant. 
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TABLE 1 
AVERAGES AND STANDARD DEVIATIONS OF 
MEASUREMENTS IN Two EDITIONS oF 
THE McCatu-Crasss STANDARD 
Test Lessons IN READING 


1926 Edition 
1950 Edition| — 





Dale-Chall 


Mean average 4.9862 5.4973 5.7492 
grade of pupils |(s = 1.1068)) (# = 1.3877) (# = 1.6565) 
answering 50% | 
correctly 
(criterion) 

Average number 
of words per 
sentence 

Average syllables 
per hundred 
words 

Average percent- 6.9413 
age of words |(s = 5.8200) 
not on Dale 
list 

Average percent- 


Flesch 








| 


15. 3986 16.5213 | 16. 8037 
(s= — (2 = 5.5509) (s = 5.3818) 





131.6131 | 134.2208) 9 — 
(a = 11.830)|(#= 13.6845)) 


75.1148 


age monosylla- |(s = 6.8083)! 
bles 

5.7603 | 
(= “ie 


Average percent- 
age polysylla- 
bles 








Those are the passages dealing with World 
War II, atomic energy, modern aviation, 
and similar recent developments. Table 1 
shows how the two editions differed in 
averages and standard deviations of the 
various measurements. 


Purpose OF REVISION 


It was felt that recalculation of these 
four formulas with the 1950 tests as a 
eriterion would accomplish two main pur- 
poses: (a) modernize the formulas by 
taking advantage of the more recently ad- 
ministered tests which should reflect some 
of the changes in pupil reading abilities 
between 1926 and 1950, and (b) establish 
formulas which are derived from identical 
materials, measured by identical rules of 
measurement on the common factor, cal- 
culated by identical mathematical opera- 
tions, and reported without adjustment. 
The latter goal seems desirable because it 
will make further comparative studies 
easier to perform and interpret (i.e., no 


R. D. POWERS, W. A. SUMNER, B. E. KEARL 


manipulations of the recalculated formulas 
will be needed in future research toward 
modernization and validation). It would 
also allow averaging of several formula re- 
sults for any sample of writing, thus per- 
haps giving more accurate scores where 
extreme accuracy is needed. 


METHODS 


The following measurements were made 
in the 383 prose passages of the 1950 edi- 
tion of the McCall-Crabbs tests: 

1. Average grade score of pupils an- 
swering half the test questions correctly. 

2. Average number of words per sen- 
tence in each passage. 

3. Number of syllables per 100 words 
in each passage. 

4. Percentage of words in each passage 
not appearing on Dale’s list of 3,000 “easy” 
words. 

5. Percentage of monosyllables in each 
passage. 

6. Percentage of polysyllables in each 
passage. 

Regression formulas were computed 
with these measurements,’ and the results 
of the formulas were compared by ap- 
plying them to 113 samples of writing 
from various publications to determine the 
practical significance of differences in for- 
mula results. The recalculated Flesch and 
Dale-Chaii formulas were also compared 
with each other and with results from 
the original formulas* in a sample of 40 
of the McCall-Crabbs passages. Such com- 
parisons with the other recalculated for- 

* Calculation facilities used were in the 
Wisconsin Numerical Research Laboratory, 
supported by a National Science Foundation 
grant and funds from the Wisconsin Alumni 
Research Foundation allocated by the Uni- 
versity of Wisconsin Graduate School Re- 
search Committee. 

*The original formulas were as follows: 
Flesch: 20684 — (1.015)(sent. length) — 
(.846)(syllables per 100 words). Dale-Chall: 
3.6365 + (.0496)(sent. length) + (.1597) (% 
non-Dale words). Gunning: 4 (sent. 
length + % poylsyllables). F-J-P: —31.517 — 
(1.015)(sent. length) + (1599)(% mono- 
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mulas were not possible because adjust- 
ments of the original formulas were not 
possible or because different rules for 
word counting were used in the original 
formula and the recalculation. 


RESULTS 


The calculations yielded the following 
recalculated formulas: 


Flesch: —2.2029 + (.0778) (sentence 
length) + (.0455) (syllables per 100 
words) 

Dale-Chall: 3.2672 + (.0596) (sen- 
tence length) + (.1155) (% non- 
Dale words) 

Farr-Jenkins-Paterson: 8.4335 “ 
(.0923) (sentence length) — (.0648) 
(% monosyllables) 

Gunning Fog Index: 3.0680 + (.0877) 
(sentence length) + (.0984) (% 
polysyllables ) 


The coefficients of multiple determina- 
tion (R*)—which indicates the amount of 
variation in difficulty among the tests 
which is accounted for by the two style 


variables in the formula—are .4034 for 
the recalculated Flesch formula, .5092 for 
the recalculated Dale-Chall formula, .3407 
for the Farr-Jenkins-Paterson recalcula- 
tion, and .3440 for the Gunning recalcula- 
tion. These statistics, which are corrected 
for degrees of freedom, show that the re- 
calculated Flesch formula statistically 
“explains” some 40% of the variation in 
difficulty of the McCall-Crabbs tests. The 
Dale-Chall formula explains almost 51%, 





syllables). The Flesch formula used for 
comparison with the new one had to be 
adjusted back to predict at the 50% level 
of the criterion and the scale reversed (ie., 
changed back to the form which was pre- 
sumably yielded directly by Flesch’s com- 
putations before he made the various adjust- 
ments. This unscaled, reversed formula, as 
nearly as we can determine, is —75695 + 
(.1015)(sent. length) + (.0864)(syllables per 
100. words). It was not possible to put the 
Farr-Jenkins-Paterson simplification on such 
a basis. 
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and is thus the much more powerful tool 
for predicting reading difficulty. The Farr- 
Jenkins-Paterson and Gunning formulas 
as recalculated are about equal in predic- 
tive power—both considerably weaker 
than the other formulas. 

The error terms for the formulas are .85 
school grades for the Flesch formula, .77 
grades for the Dale-Chall formula, and 
.90 grades for the others. Converting the 
predicted value for each formula into a 
grade level figure and following the stand- 
ard practice of taking a range of plus or 
minus two standard errors as the probable 
area in which the “true” value lies, the 
error range would be 1.71 grades for the 
Flesch formula, 1.55 grades for the Dale- 
Chall formula, and 1.80 grades for both 
the others. Thus the Dale-Chall formula 
came through the recalculations as slightly 
more precise than the others. 

Table 2 presents comparisons of various 
statistics for the Flesch formula and Dale- 
Chall formula in their recalculated and 
original forms and for the recalculated 
Farr-Jenkins-Paterson simplification and 
Gunning Index. 

To assess the practical significance of 
the revision, the original and recalculated 
forms of the Flesch and Dale-Chall for- 
mulas were applied to 47 sample passages 
from a variety of sources. The recalculated 
Dale-Chall formula consistently gave lower 
scores than the original; the average dis- 
crepancy (average absolute deviation) be- 
tween the two was .94 grades. The average 
diserepancy between the original and re- 
calculated Flesch formulas was 85 grades, 
with the recalculated formula giving a 
lower score about two-thirds of the time. 

All four recalculated formulas were com- 
pared in a sample of 113 passages from 
15 magazines. The results are given in 
Table 3. The writers feel that two observa- 
tions from the table are worthy of mention 
here. First, the average discrepancy of re- 
sults using the recalculated Flesch and 
Dale-Chall formulas was .54 school grades. 
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TABLE 2 
THE RECALCULATED AND ORIGINAL ForMULAS 


REGRESSION STATISTICS 











l 
Gunning 
Index 


Recalculated Recalculated | Recalculated 


.2019 | 


-2526 | 
-1055 | 
1146 | 
.6293 
0923 
-0648 
.3199 
. 3986 
8.4335 
3407 


F-J-P 


Dale-Chall Formula rw dn 


Flesch Formula 
Statistic* 





Recalculated| Original Original 





2019 
4759 
. 1599 
-0450 
. 3883 
-0596 
1155 
- 2065 
-6073 
3.2790 
5092 


.2191 
.4670 
. 2607 
1331+ 
-1936+ 
-0496 
. 1579 
-1611 
-6011 
3.6365 
-4900 


rss 2019 .2019 
rs; .3436 
ro5 . 1363 
T1213 .0987 
ri3.2 3117 
bias .0778 
bis.2 .0455 
Biss . 2697 
Bis.2 -4865 
Qi .23 —2.2029 
R*1.93 .4034 


Note.—+ Values computed from those given by Flesch or Dale and Chall by the relationship: 
rij — Tékrik 
Via - Aa - Fy) 
® Subscripts refer to (1) the criterion, (2) average sentence length. Subscript (3) refers to a different variable for 
each formula: syllable per 100 words for Flesch formula, percentage words not on Dale List for the Dale-Chall formula, 
percentage monosyllables for the Farr-Jenkins-Paterson formula, and percentage polysyllables for the Gunning 
index. 


- 2695 
-4420 
-2157 
1743+ 
- 2202+ 
-1015 
0846 
- 2639 
5422 
—7.5695 
-4966 


1265 
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In ‘the comparison of the original Dale- The recalculated Gunning Index gave 


Chall and Flesch formulas above, the aver- 
age discrepancy was .87 grades. All four 
recalculated formulas agreed much more 
closely with one another than the original 
Dale-Chall and Flesch formulas did. This 
would seem to be a point in favor of the 
recalculations. 


TABLE 3 
Comparisons BEetTwEen RESULTS WITH 
RECALCULATED ForRMULAS APPLIED TO 
113 100-worp Samp.ies or Prose 








Positive 


Negative 
deviations i 


tions | Average 
jabsolute 
devia- 





(num-|.;..|(num-| tion 


ber) 





Dale-Chall and Flesch | . (58) | . (55) 
Flesch and Gunning ° (82) |. (31) 
Dale-Chall and ¢ (82) | . (29) 

Gunning 
Flesch and F-J-P J (96) |. (16) 
Dale-Chall and F-J-P | . (93) |. (20) 
Gunning and F-J-P . (70) | . (42) 




















results that were in slightly higher agree- 
ment with the results of the recalculated 
Flesch formula than were the results with 
the recalculated Farr-Jenkins-Paterson 
simplification. The average absolute devia- 
tion between the recalculated Flesch for- 
mula and Gunning Index was .44 grades, 
with 73% of the predictions lower than 
those of the Flesch formula. The average 
absolute deviation between the recalcu- 
lated Flesch formula and the recalculated 
Farr-Jenkins-Paterson simplification was 
50 grades, with 85% of the results with the 
simplification being lower than results with 
the Flesch formula. Thus the two simplifi- 
cations gave slightly lower scores than 
the recalculated Flesch formula. Scores 
with the recalculated Gunning Index were 
slightly closer to the Flesch results, and 
there were more instances of predictions 
which were higher than the Flesch predic- 
tions than was true of the Farr-Jenkins- 
Paterson formula as recalculated. 
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TABLE 4 
Norms” OF REcaLcu LATED Formvu LA ScorEs FOR MATERIAL or Various Types 








Material 


| 
| Farr- 
Flesch | Dale-C hall Gunning | Jenkins- 





Scientific: 

Phytopathology, Soil 
Science, Journal of 
Nutrition, Science, 
American Journal of 
Veterinary Research 

Academic: 

Yale Review, Harvard 
Educational Review, 
Annuals of the Ameri- | 
can Academy of Politi- | 
cal and Social Sciences. 

Quality: 

Harper’s, Atlantic 

Monthly 


Difficult 


Difficult | 





Fairly 
Diffi- 
cult 

Standard: 
Reader’s Digest Average 

Slick Fiction: 

Colliers, Ladies Home | Fairly 
Journal, Good House-| Easy: 
keeping 

Pulp Fiction: 


True Confessions Easy 





Means... 
Ranges. . 


| Paterson 
} 
ol | 


8.00, 7.70 


7. 
|7.10-9.507. 10-10. 706. 70-8.906.30-7. 


20 
40 


7.90) 8.40) 7. 
..|7.00-8.70, 7.50-8.60,7.10-8. 





8.70-4.50/3.80-4. 80). 














The application of the formulas to the 
113 passages also provides some “norms” 
for interpreting scores which they yield. 
The passages came from various types of 
publications, presumably representing gen- 
erally different levels of reading difficulty 
as noted in Table 4. 

Use of such a scale is admittedly a rough 
manner of interpreting readability scores, 
and the scale in Table 4 was not formed in 
the most exact manner; although sampling 
was at random within issues, the issues 
were not randomly chosen and calcula- 
tions were rounded. However, this gen- 
eral approach seems more desirable than 
using the theoretical formula result (grade 
level) or making adjustments in the theo- 
retical result without benefit of extended 
testing. Further details on background, 


method, and results of this work are avail- 
able on microfilm (9). 


CONCLUSIONS AND Discussion 


To recommend use of the four recalcu- 
lated formulas in preference to the origi- 
nal ones is a rather drastic step, in view 
of the wide use the original formulas have 
enjoyed. However, such a recommendation 
is made here for the reasons we set forth 
in the paragraphs on the purpose of the 
recalculation. 

The formula coefficients derived in the 
recalculations on the 1950 McCall-Crabbs 
tests have the same statistical validity as 
those calculated on the 1926 edition of the 
tests. They are statistically preferable to 
those formed by rougher, short-cut pro- 
cedures. 
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Reservations in making a recommenda- 
tion to use the recalculated Dale-Chall and 
Flesch formulas stem from two basic 
sources: (a) Readability formulas are such 
rough estimates at best that to say one 
result is better than another is statistically 
hazardous—especially when the nature of 
the material on which the formulas are to 
be used differs from that of the material 
used in computing the formula. (6) In 
the revision of the McCall-Crabbs criterion 
tests, passages of higher difficulty were 
omitted. The style measurements of these 
passages and the educational level of pu- 
pils taking these more difficult passages 
might have been of a type which more 
nearly approaches the type of writing and 
audience for which the formulas are nor- 
mally used. In other words, restriction of 
the range of difficulty in the 1950 tests may 
have made this edition less suitable than 
the 1926 tests for building readability for- 
mulas. But to the extent this argument 
is sound, all linear formulas suffer equally 
from the curvilinearity it implies. 

It is further recommended that the 
Dale-Chall formula be used whenever pos- 
sible in the absence of specific reasons for 
preferring the Flesch formula or one of its 
simplifications. The Dale-Chall formula 
was best in terms of small error and high 
prediction power. This parallels an earlier 
judgment by Klare (5) that the original 
Dale-Chall formula was better than the 
original Flesch formula by a slight margin. 

The statements here as to error and pre- 
diction power of the formulas apply only 
to prediction and precision in regard to 
the criterion passages. They do not un- 
equivocally hold true for the formulas as 
they are normally used—for estimating 
difficulty of adult reading materials. It is 
possible that a formula with low precision 
or predictive power in this research could 
be fully as precise as the others for pre- 
dicting adult reading difficulty. But there 
is no direct evidence that this would be so, 
and the only recourse at present seems to 
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be to give the Dale-Chall formula the 
highest place on the basis of its prediction 
power and small error computed on the 
criterion. 

Some formula-users—particularly those 
who use formulas only occasionally—are 
understandingly reluctant about referring 
to a word list, which is required by the 
Dale-Chall formula. Of popular formulas 
without word lists, the Flesch formula is 
statistically best. 

Those who use either simplification of 
the Flesch formula should recognize that 
they are sacrificing precision and accuracy 
by doing so. But it seems evident that for 
estimates of readability which need to be 
performed rapidly and where precision is 
not extremely important, either simplifica- 
tion will do the job. 

There are two ways of looking at pre- 
cision in a readability formula. One way 
is to admit that formulas are rough esti- 
mates at best, and that a loss of a little 
precision is not important. The other is 
to argue that since the formulas give only 
rough estimates, it is important to keep 
whatever precision and prediction power 
exists. 

The choice of viewpoint seems to hinge 
on the use to be made of formula results. 
A news writer or editor who uses a formula 
“to see how we are doing” could probably 
regard all four formulas as equal for his 
purpose and use whichever formula he 
found easiest to apply. If readability scores 
are part of a research design, however, the 
social scientist will want to choose the most 
powerful and precise formula even though 
it entails more difficulties in application. 
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In the March 1955 issue of this journal 
the authors, together with Knief, presented 
data showing that the use of height age, 
weight age, and dental age contributed 
practically nothing to a least squares 
estimate of either reading or arithmetic 
achievement when combined with mental 
age (1). That is to say, mental age alone 
provided about as accurate an estimate 
of achievement in these areas as a least 
squares combination of mental age and 
these three other age scores. 

It is well-known that for the model 
assumed (usually a first degree poly- 
nomial) a least squares combination of 
scores provides composite scores for the 
individuals at hand which bear a maxi- 
mum degree of relationship to the cri- 
terion. Hence, multiple correlations be- 
tween achievement and the component 
variables that enter into organismic age 
(OA) cannot be lower thay the simple 
correlations between achievement and 
mental age (MA) alone. In such multiple 
correlation analyses the component age 
scores are, of course, automatically ideally 
weighted before being combined. In the 
formation of the OA score, on the other 
hand, the component age scores are given 
equal weights’ since the OA score is the 
simple unweighted average of the com- 
ponent age scores. Because of the nature 
of the relationships among these age scores 
it follows that the correlation between 
educational achievement and OA must 
necessarily be considerably lower than that 
between achievement and MA alone. The 
purpose of this note is to demonstrate 


*By a weight we mean the constant by 
which the score for a trait is multiplied be- 
fore it is combined with other similarly 
weighted scores to form a composite. 


this fact analytically. We shall also use 
data reported in our previous article to 
illustrate the extent of this attenuating 
effect of anatomical and physiological age 
scores when used in combination with MA 
to predict school achievement. 

In the following discussion and in keep- 
ing with our previous article, we shall 
consider as estimators only mental, height, 
weight, and dental age scores. We shall 
designate these as M, H, W, and D, re- 
spectively. Reading and arithmetic scores 
will be used as measures of school achieve- 
ment and will be designated R and A, 
respectively. Since the correlation between 
the sum of the four age scores and either 
R or A is identical with the correlation 
between the mean of these four scores 
and R or A, we shall discuss the efficacy 
of OA as a predictor of R or A where 
OA = O = M + H + W + D. The 
symbol Cov RO will be used to refer to 
the covariance between R and O scores 
while the symbol Var O will be used to 
refer to the variance of the O scores. 

Consider reading (R) as the criterion. 
Then 


Cov RO = Cov R(M + H + W + D) 
= Cov RM + Cov RH {1} 
+ Cov RW + Cov RD. 


But the various anatomical and physio- 
logical scores tend to bear a very low de- 
gree of relationship to reading achieve- 
ment so that Cov RM is large in relation 
to Cov RH + Cov RW + Cov RD. That 
is, the addition of H, W, and D to M does 
not supplement Cov RM to any marked 
extent, so that Cov RM accounts for most 
of Cov RO. 
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Next note that 
Var O = Var(M + H + W + D) 
= Var M + Var H + Var W 
+ Var D + 2 Cov MH 
+ 2 Cov MW + 2 Cov MD 


(2) 


+ 2 Cov HW + 2 Cov HD 
+ 2 Cov WD. 


It is clear from [2] that even if the 
covariances involving M with H, W, and 
D are small, the variance of M is a rela- 
tively small portion of the variance of O. 
Now the correlation between R and M is 
given by 


Cov RM 
V (Var R)(Var M)’ 








(3) 


while the correlation between R and O is 
given by 


Cov RO > 
V/ (Var R)(Var O) 


(4) 


As we have indicated, the numerator of 
[3] differs little from that of [4], while 
the denominator of [4] is much greater 
than that of [3] due to the fact that Var 
O must necessarily be much greater than 
Var M. Hence it is a simple mathematical 
fact that the correlation between R and O 
must be less than that between R and M. 
It, of course, follows in general that 
mental age alone is a much more useful 
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predictor of school achievement than is 
mental age in equally weighted combina- 
tion with various anatomical and physio- 
logical age scores, that is, OA. Moreover, 
the more physiological and anatomical 
age scores used in determining OA, the 
greater the attenuation of the correlation 
between OA and a school achievement 
criterion. 

To show how marked this attenuation 
actually becomes, Formulas [1], [2], and 
[4] were applied to the products matrix 
used in the least squares analysis reported 
in our previous paper. In this paper the 
correlation between R and M was reported 
as .645. When H, W, and D are added to 
M to form O, the correlation of R with 
O is only .24. With arithmetic achieve- 
ment as the criterion, the correlation be- 
tween A and M previously reported was 
551. In this case when H, W, and D are 
added to M to form O, the correlation of 
A with O is 21. 

In brief, there are neither theoretical 
nor empirical bases for believing that 
organismic age predicts school achieve- 
ment. This is not to say that OA may 
not be useful in predicting other types of 
behavior. However, evidence of such use- 
fulness is not as yet generally available. 
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